Search This Blog

Thursday, January 8, 2015

System Center Custom Application Monitoring

Sometimes, a problem can seem really hard, but turn out to be rather simple if approached from a different perspective. The application team I support has a poorly written application running on a server. This should sound familiar to most system admins out there.

The application crashed the other day, but none of the windows services actually stopped, nor were there any event log errors to really go off of either. How do we monitor that service then? One option might be the TCP port but nobody seemed to know what that was. Digging into the application, it had a small scripting engine, which allowed us to run some basic scripts.

The first thought the application team had was, we'll write an event log saying everything is ok, and when that doesn't appear, we want an alert. Well, we can monitor for missing alerts in SCOM, but that seemed like it would be destined for error.

What we settled on instead was to have the program simply drop a file in the temp directory. It would put the file there every 30 minutes, with the same name. So now what? I created a small and simple batch file that would check for the file, then delete it if it was there. Otherwise, report the file missing and the service stopped.

IF EXIST C:\TEMP\running.log GOTO Good
EVENTCREATE /T ERROR /ID 333 /L application /d "Custom Application Failed"
:Good
DEL C:\TEMP\running.log /q

I then set a schedule task to run every 30 minutes to run this batch file. When the file went missing, it would write the error to the event log. From there, just setup an event monitor in System Center to catch and alert on the event.

Operations Manager Performance Trending with Excel - No SQL Required

A powerful tool for administrators is to trend data to troubleshoot performance problems and forecast future resource needs. In the past, I've run SQL queries but needed a way to instruct support staff on a basic means to accomplish the same tasks right from the console. Thankfully, Operations Manager and Excel allow just that.

Let's get started.

Fire up the Operations Manager Console to the Monitoring section, then open the Windows Computers view (or any section where you access the computer health view, such as SQL)


Find a server you're interested in or simply select one from the list.

 
After selecting the system, select the Performance View under the Navigation pane of the task panel on the right side of the management console.
 
 
Once the Performance View window comes up, in the Performance Actions pane on the right side of the console, change your time frame via Select Time Range, selecting a meaningful period, such as two weeks or longer.
 
 
 
 
Now at the bottom of the performance monitor screen, select a counter you're interested. I'll use Percent Memory Used for this example.


When selected, a graph should display such as the following:
 
  
Going back to the Performance Actions pane, select Copy Data to Clipboard

Open up notepad and past the contents, which should look similar to the following:


Save the file with an xml extension
 
 
  
Now open Excel, select the Data tab and select From Other Sources -> From XML Data Import
 

Select the XML file created earlier; accept the import defaults when prompted
 

This should populate the Excel spreadsheet with an X and Y column. The first column is the date/time stamp and the Y column is the performance data.
 
 
Select all of the Y data and with it highlighted, select the Insert tab -> Line -> 2-D Line to generate a graph.
 

This should yield a graph in Excel such as the following:
 

Right-click on the graph line and select Add Trendline
 

Generally, accepting the default will paint a trendline  that is helpful for finding issues such as a memory leak or consistent data usage on a hard drive. However, you can play around with the trend to perform longer-term forecasts. I added 50 periods to the end of my trend line to see how memory might look in the future after my data set.
 
 
Graph results with the trendline:
 
 
The line extends a bit beyond the graph data and shows an overall flat trend on memory utilization. If the server had a memory leak, as an example, the graph might trend steadily upwards like this:

 
There you have it, a simple but powerful tool for analyzing data recorded in SCOM without a lot of effort.

Tuesday, December 23, 2014

AD Site Availability Degraded / AD Site Performance Health Degraded

After deploying the Active Directory Management Packs, we had a domain controller start alert spewing. I had not come across anything out there that really dealt with the alert; the warning from this type of event was not in eventid.net either. But it's all figured out now and here is the solution to the perplexing problem I encountered.

You could also title this, "How to Perform an Online/Offline Defragmentation of your Health Service Store in System Center".

Problem Description:

First, the SCOM console began to fill up with AD Site Availability Health Degraded and AD Site Performance Health Degraded critical alerts from the Active Directory Management Packs.

AD Site Availability Health Degraded and AD Site Performance Health Degraded

On the offending domain controller, I observed the following Application event log spewing:
 

 
The contents of the warning were as follows:
HealthService (1704) A significant portion of the database buffer cache has been written out to the system paging file. This may result in severe performance degredation. See help link for complete details of possible causes. Log Name: Application | Source: ESENT | Event ID: 906

Troubleshooting:

Initially what I suspected was that I had an application or process going bonkers on the server, taking up memory and causing the SCOM agent to malfunction or be starved of resources. I loaded the Systernals Process Monitor utility to see what was happening when these events fired off, since typically it only took a few minutes in between each event. What was captured was a significant amount of file activity from the Health Service to
C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Health Service Store\HealthServiceStore.edb . Essentially, there was no other process at the time of these warnings or corresponding alerts in the System Center Management Console that could account for issues on the system.
 
 
SCOM HealthService | ReadFile | C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Health Service Store\HealthServiceStore.edb
 
With the smoking gun being the Health Service Database, I performed some quick online maintenance from within the console to start.
 
In the Operations Manager Console, I started by browsing to the Operations Manager folder, then Agent Details and selecting the Agents by Version view.
Management Console Tree -> Operations Manager -> Agent Details -> Agents By Version
 
 
Selecting the offending computer brought up the Health Service Tasks I could perform, Start Online Store Maintenance, being the one I was looking for.
Management Console Health Service Task for Health Service Database Maintenance | Start Online Store Maintenance
 
Final Solution:

Unfortunately, the online store maintenance was not adequate enough to remediate the errors and warnings I was encountering so I opted for an offline defragmentation of the Health Service Store database. Perform the following if local warnings persist on the client system.
 
  • Login to the offending client system via console or RDP
  • Open an administrative command prompt
  • Change directory to "C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Health Service Store"
  • From the service console (services.msc) or from command prompt (net stop “Microsoft Monitoring Agent”), stop the Microsoft Monitoring Agent service
  • Run esentutl /r edb (without this, you likely won't be able to perform a defragmentation)
  • Next, run esentutl /d HealthServiceStore.edb
Running esentutl /d HealthServiceStore.edb in order to compact and defragment the health service database after log spewing occurred from loading the Active Directory management packs

When this completed, my HealthServiceStore.edb file went from 174MB to 27Mb and both the warnings in the local Application event log and the critical health alerts in the System Center Operations Manager Console went away.

Wednesday, August 6, 2014

Problems with the 2012 R2 Web Consoles

This post is a little long, but I wanted to include as much pertinent error information as possible to help folks properly identify if they are encountering the same type of issue.

Recently upgraded our systems to SCOM 2012 R2 and encountered some issues with client connectivity to the web console. SQL is on a separate system from the management console. Web and Management Console is on the same system (for perspective on how our systems are distributed).

First, let's start with some of the errors I was seeing:

From a client, attempting to connect to the AppAdvisor console:

Error on the client:

An error has occured - The additional error information can be found int he Windows Application Log. We appologize for any inconvenience caused by this temporary service outage.


Warning on the SCOM management server when connecting to the AppAdvisor console:

Event code: 3005 Event message: An unhandled exception has occurred. Event time: 8/5/2014 9:38:10 AM :
Event time (UTC): 8/5/2014 4:38:10 PM :
Event ID: 20964fc40f3c43348ccff13e467e259a :
Event sequence: 7 :
Event occurrence: 1 :
Event detail code: 0 :
:
Application information: :
Application domain: /LM/W3SVC/1/ROOT/AppAdvisor-1-130517302775480349 :
Trust level: Full :
Application Virtual Path: /AppAdvisor :
Application Path: C:\Program Files\Microsoft System Center 2012 R2\Operations Manager\WebConsole\AppDiagnostics\AppAdvisor\Web\ :
Machine name: SCOM-MS01 :
:
Process information: :
Process ID: 4332 :
Process name: w3wp.exe :
Account name: NT AUTHORITY\NETWORK SERVICE :
:
Exception information: :
: Exception type: WebException :
Exception message: The request failed with HTTP status 401: Unauthorized.:
:
Request information: :
Request URL: http://scom-ms01/AppAdvisor/Pages/ReportService/ReportServicePageImpl.aspx?_r=&_c=g&_pg=436ac5a4-3e70-41b9-9fe1-5a5c96724dc0&_s=2C369460 :
Request path: /AppAdvisor/Pages/ReportService/ReportServicePageImpl.aspx :
User host address: :
User: :
Is authenticated: True :
Authentication Type: Forms :
Thread account name: NT AUTHORITY\NETWORK SERVICE :
:
Thread information: :
Thread ID: 17 :
Thread account name: NT AUTHORITY\NETWORK SERVICE :
Is impersonating: False :

Similarly, I received that error when connecting to the AppDiagnostics site as well:

Event code: 3005
Event message: An unhandled exception has occurred.
Event time: 8/5/2014 9:32:02 AM
Event time (UTC): 8/5/2014 4:32:02 PM
Event ID: 67e2d2ba9c4842c3bc041c62bad932e3
Event sequence: 8
Event occurrence: 1
Event detail code: 0
Application information:
Application domain: /LM/W3SVC/1/ROOT/AppDiagnostics-2-130517299136496487
Trust level: Full
Application Virtual Path: /AppDiagnostics
Application Path: C:\Program Files\Microsoft System Center 2012 R2\Operations Manager\WebConsole\AppDiagnostics\Web\
Machine name: SCOM-MS01

Process information:
Process ID: 8048
Process name: w3wp.exe
Account name: IIS APPPOOL\OperationsManagerAppMonitoring

Exception information:
Exception type: OleDbCommandException
Exception message: Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'.
Command text: Select CONFIGID, CONFIGNAME, CONFIGVALUE From apm.CONFIG
Connection: Provider=SQLOLEDB;Server=scom-sql;database=OperationsManager;Integrated Security=SSPI;

Request information:
Request URL: http://scom-ms01/AppDiagnostics/Pages/Authenticate.aspx?ReturnUrl=/appdiagnostics
Request path: /AppDiagnostics/Pages/Authenticate.aspx
User host address:
User:
Is authenticated: False
Authentication Type:
Thread account name: IIS APPPOOL\OperationsManagerAppMonitoring

Thread information:
Thread ID: 9
Thread account name: IIS APPPOOL\OperationsManagerAppMonitoring
Is impersonating: False

And finally, on the primary /OperationsManager web console, I'd receive an authentication error. The client would be prompted multiple times for a username and password and eventually bomb out.

 
Server Error - 401 - Unauthorized: Access is denied due to invalide credentials. You do not have permission to view this directory or page using the credentials that you supplied.
 
Solving the problem.

First step was a prerequisite for both the AppAdvisor and AppDiagnostic issues.
  1. Open the IIS console on the web console server
  2. Select "Application Pools"
  3. Select "OperationsManagerAppMonitoring"
  4. If you are receiving the errors and the application pool "Identity" is set to "ApplicationPoolIdentity", with the OperationsManagerAppMonitoring pool highlighted, select "Advanced Settings" option in the action pane.
  5. Under "Process Model", change the Identity from ApplicationPoolIdentity to "NetworkService"
  6. Run an IISReset at an administrator (elevated) command prompt
At this point, the AppDiagnostic website started working, but the AppAdvisor site did not. I had to perform additional steps for that site.
  1. Open the IIS console on the web console server
  2. Select and expand the site (Default Web Site on my server) where the Operations Manager web console is installed.
  3. Select the virtual directory named "AppAdvisor"
  4. Open the "Authentication" applet
  5. If not already enabled, enable the "Anonymous" and "ASP .NET Impersonation" methods
  6. Run an IISReset at an administrator (elevated) command prompt
Final piece to get into the Operations Manager web console was to adjust an IE setting, oddly enough. To fix this portion, I took the following steps:
  1. Open "Internet Options" in Internet Explorer
  2. Select the "Advanced" tab
  3. Scroll almost all the way down and uncheck the box for "Enable Integrated Windows Authentication"
After these adjustments, all web consoles were available for remote clients.

Friday, February 7, 2014

SCOM 2012 Failed Accessing Windows Event Log with Veeam Management Pack

Noticed during a routine health check that our two Management Servers were showing a warning state. Error read as "Failed Access Windows Event Log" <management server 1> (Health Service).

Error details show the following:

The Windows Event Log Provider is still unable to open the Veeam Collector event log on computer 'management server 1'. The Provider has been unable to open the Veeam Collector event log for 720 seconds. Most recent error details: The specified channel could not be found. Check channel configuration. One or more workflows were affected by this. Workflow name: many Instance name: many Instance ID: many Management group:

We have the Veeam management pack for SCOM loaded and sure enough, this appears to be a documented issue on the Veeam knowledge base.

http://www.veeam.com/kb1496#/kb1496

Thursday, November 21, 2013

Windows 2012 WMI Hotfix

Had a 2012 Server that was being monitoring by System Center lock up on us today. Suspect a WMI leak. Hotfix deployment, engage!

http://support.microsoft.com/kb/2790831/en-us

Friday, November 15, 2013

SCOM 2012 Powershell - Retrieving a List of Computers in a Group

Had to search for a batch file that is on one of the many SQL servers we have in the environment. First inclination was, let me pull the systems from SCOM since it has all our SQL servers.

Poked around the interwebs a while and noticed a lot of scripts had references to 2007 commands that hadn't been updated to 2012. Here's the basic steps taken to get my group of SQL servers. You could perform the same task on pretty much any group in the same manner.

  • Open the Operations Manager Shell powershell console

Image illustratin the correct System Center 2012 Operations Manager Shell to open for running the powershell commands
  • Type in : Get-SCOMGroup
Image shows the sample output of running the SCOM 2012 Get-SCOMGroup command in powershell
  • Search for the group you want to retrieve members from
  • Now type in: $Group = Get-SCOMGroup |  where {$_.DisplayName -eq "SQL Computers"} (or insert the group your looking for instead of SQL Computers")
Image illustrates running the Get-SCOMGroup command with a filter for a specific group and assigning to a variable

  • Next, type in: $Members = $Group.GetRelatedMonitoringObjects()
 
Illustrates the use of the command GetRelatedMonitoringObjects() for retriving a list of group members and assigning them to a variable

  • Now, you can simply type: $Members
 
Illustrates the output of members captured in the previous step using GetRelatedMonitoringObject(). Should show three headings and then the server members from the group

  • Or, pipe the command out to a file: $Members | Sort DisplayName | FT DisplayName | out-file C:\Scripts\Servers.txt
 
Illustrates running the following command in powershell to pipe a variable out to a file: $Members | Sort DisplayName | FT DisplayName | out-file C:\Scripts\Servers.txt