Search This Blog

Monday, November 26, 2012

SCOM Exchange Management Pack Pitfall

Well this was certainly a fun one to track down. Again, being my week on call and it being a holiday, there was no rest for the wicked apparently. The problems actually surfaced earlier in the week but then reared their head again on Saturday. If you get into this situation, there will be what seems to be random issues with exchange, queues shutting down, mailboxes getting disconnected. All sorts of weird stuff. I'll explain more after the errors. Here are some of the error messages we started receiving in the email queue and the exchange event logs.

***************************************************
Alert: The database copy is very low on log volume space. The volume has reached critical levels.

Source: Database Copy (log) Logical Disk Space (D:\DB1) - <server> (Mailbox) -
 
Path: <server>; <server>(Mailbox) - Last modified by: System Last modified time: 11/24/2012 10:33:24 AM Alert description: TimeSampled: 2012-11-24T10:32:45.0000000-08:00

ObjectName: LogicalDisk

CounterName: % Free Space

InstanceName: D:\DB1

Value: 4

SampleValue: 11.9463672637939


***************************************************
Log Name:      Microsoft-Exchange-Troubleshooters/Operational
Source:        Database Space
Date:          11/23/2012 4:27:32 PM
Event ID:      5701
Task Category: (1)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:     <servername>
Description:
The database space troubleshooter detected a low space condition on volume D:\DB1\ for database DB1. Provisioning for this database has been disabled. Database is under 16% free space.
***************************************************

Now let's pull the curtains back a bit and find out what's going on here. There are several things involved. First, Exchange 2010 has a powershell script called troubleshoot-databasespace.ps1. Without SCOM, this script would be called manually. The Exchange management pack, however, calls it automatically. More details can be found here - http://letsexchange.blogspot.com/2012/09/exchange-2010-sp1-added-new-script.html

Troubleshoot-databasespace.ps1 refers to a file that has the limits set to gauge the database health. Here are the default constants in the file, StoreTSConstants.ps1:

There were found in the \Exchange14\Scripts folder

# The percentage of disk space for the EDB file at which we should start quarantining users.
$PercentEdbFreeSpaceDefaultThreshold = 25
# The percentage of disk space for the logs at which we should start quarantining users.
$PercentLogFreeSpaceDefaultThreshold = 25
# The percentage of disk space for the EDB file at which we are at alert levels.
$PercentEdbFreeSpaceAlertThreshold = 16
# The percentage of disk space for the EDB file at which we are at critical levels.
$PercentEdbFreeSpaceCriticalThreshold = 8

So, in our case, we have a 600 Gb lun and were down to roughly 12% of our space, falling below the alert levels set by default, but still had 72Gb of storage left. So exchange went into alert mode. We started receiving issues of users not being able to connect to exchange or unable to send messages but still able to receive them. Very strange stuff. I changed the values to start alerting at 5% and then put the servers in maintenance mode to get over the immediate issue.

So, don't get caught in the Exchange Management Pack trap. Ensure you set these levels to something that makes sense. Unlike the SCOM disk alerts that have a two factor calculation mechanisms, this is the old-fashioned percentage calculation, making it very easy to get bit on large volumes. The other option is that you turn off these monitors. Given the space monitoring is somewhat redundant with the SCOM disk space alerts, that may be a safer alternative.








Friday, November 23, 2012

Using Monitors for Automated Disk Space Recovery

Thought I would do a new post today. It's the day after Thanksgiving and guess who drew the after-hours phone a monitoring ticket this week? So, bunch of disk space monitors came in and I decided I didn't want to have to respond to them, especially since I will have after-hours duties during Christmas as well, lucky me.

So the concept I'll illustrate here is to have SCOM take automated actions to clean-up drive space on its own, hopefully averting an impending disaster and the need for you to get up from your peaceful slumber, camping trip or whatever it is you do when you try to have a life outside of daily IT tasks.

As with any Microsoft product, there are about a dozen ways to do this. I give you probably the simplest way, but long term would probably be difficult to maintain in a large server environment. I'll improve on this and post any updates when I do.

First, let us start by creating a folder on the root drive of a target server. I called the folder "C:\Scripts". Create a batch file in the folder. I called mine "cleanup.bat"

System Center 2012 SCOM Disk Cleanup Script Folder


I have included the contents of the sample script. It is pretty basic, essentially clearing out temporary files in a variety of locations. This could certainly be expanded to remove old IIS or Blackberry log files, remove SQL backups, etc. Any action that can be scripted can essentially be run here.


del C:\Temp\*.* /s /q
del %Windir%\Temp /s /q
IF EXIST "C:\Users\" (
    for /D %%x in ("C:\Users\*") do (
        del /f /s /q "%%x\AppData\Local\Temp\"
        del /f /s /q "%%x\AppData\Local\Microsoft\Windows\Temporary Internet Files\"
    )
)
IF EXIST "C:\Documents and Settings\" (
    for /D %%x in ("C:\Documents and Settings\*") do (
        del /f /s /q "%%x\Local Settings\Temp\"
        del /f /s /q "%%x\Local Settings\Temporary Internet Files\"
    )
)
With the script loaded on the target server, go to your SCOM management console and navigate to the authoring tab and then select the "Monitors" option.



After the monitors all load, in th search field, type in "Logical Disk" to narrow the monitors. Once the search finishes, you can then select one of the Logical Disk Free Space monitors, such as Windows Server 2000, Windows Server 2003 or Windows Server 2008. I chose to alter the settings for the Windows Server 2008 Logical Disk Free Space monitor. Right-click on the Logical Disk Free Space monitor and select "Properties".



Navigate to the "Diagnostic and Recovery" tab. Here, we will want to add a recovery task. You can put in the same task for both warning and critical recovery tasks if you like.

In the "Recovery Task Type", select "Run Command" and change the destination management pack to either the default client overrides or an alternate management pack you might have for such customizations.

On the next section, give the recovery task a name that makes sense. You can choose to have the system recalculate the monitor. If disk space falls back into the norm, the alert will clear itself automatically.

On the last screen, enter the path on the target server where the script was located. As I outlined in the beginning, I put the scripts in the "C:\Scripts\" folder with a file called cleanup.bat.

You are basically done at this point. When the alert or warning condition occurs (which ever you setup to respond to), the script will kick off and delete the files.

The nice thing about running a recovery task in this manner is that if the alert persists, you know there is a larger problem as the basic steps of clearing up misc., temporary files has already been accomplished. Now you know there is something out of the ordinary going on.

You can now test this script by running a program called Philip 2.10. This will create a large file in one of the temporary directories cleared by the sample script. You can download the software from the following location:

http://www.softpedia.com/get/System/File-Management/Philip.shtml