Sometimes, a problem can seem really hard, but turn out to be rather simple if approached from a different perspective. The application team I support has a poorly written application running on a server. This should sound familiar to most system admins out there.
The application crashed the other day, but none of the windows services actually stopped, nor were there any event log errors to really go off of either. How do we monitor that service then? One option might be the TCP port but nobody seemed to know what that was. Digging into the application, it had a small scripting engine, which allowed us to run some basic scripts.
The first thought the application team had was, we'll write an event log saying everything is ok, and when that doesn't appear, we want an alert. Well, we can monitor for missing alerts in SCOM, but that seemed like it would be destined for error.
What we settled on instead was to have the program simply drop a file in the temp directory. It would put the file there every 30 minutes, with the same name. So now what? I created a small and simple batch file that would check for the file, then delete it if it was there. Otherwise, report the file missing and the service stopped.
IF EXIST C:\TEMP\running.log GOTO Good
EVENTCREATE /T ERROR /ID 333 /L application /d "Custom Application Failed"
:Good
DEL C:\TEMP\running.log /q
I then set a schedule task to run every 30 minutes to run this batch file. When the file went missing, it would write the error to the event log. From there, just setup an event monitor in System Center to catch and alert on the event.
Search This Blog
Thursday, January 8, 2015
Operations Manager Performance Trending with Excel - No SQL Required
A powerful tool for administrators is to trend data to troubleshoot performance problems and forecast future resource needs. In the past, I've run SQL queries but needed a way to instruct support staff on a basic means to accomplish the same tasks right from the console. Thankfully, Operations Manager and Excel allow just that.
Let's get started.
Fire up the Operations Manager Console to the Monitoring section, then open the Windows Computers view (or any section where you access the computer health view, such as SQL)
Find a server you're interested in or simply select one from the list.
After selecting the system, select the Performance View under the Navigation pane of the task panel on the right side of the management console.
Once the Performance View window comes up, in the Performance Actions pane on the right side of the console, change your time frame via Select Time Range, selecting a meaningful period, such as two weeks or longer.
Now at the bottom of the performance monitor screen, select a counter you're interested. I'll use Percent Memory Used for this example.
When selected, a graph should display such as the following:
Going back to the Performance Actions pane, select Copy Data to Clipboard
Open up notepad and past the contents, which should look similar to the following:
Save the file with an xml extension
Now open Excel, select the Data tab and select From Other Sources -> From XML Data Import
Select the XML file created earlier; accept the import defaults when prompted
This should populate the Excel spreadsheet with an X and Y column. The first column is the date/time stamp and the Y column is the performance data.
Select all of the Y data and with it highlighted, select the Insert tab -> Line -> 2-D Line to generate a graph.
This should yield a graph in Excel such as the following:
Right-click on the graph line and select Add Trendline
Generally, accepting the default will paint a trendline that is helpful for finding issues such as a memory leak or consistent data usage on a hard drive. However, you can play around with the trend to perform longer-term forecasts. I added 50 periods to the end of my trend line to see how memory might look in the future after my data set.
Graph results with the trendline:
The line extends a bit beyond the graph data and shows an overall flat trend on memory utilization. If the server had a memory leak, as an example, the graph might trend steadily upwards like this:
There you have it, a simple but powerful tool for analyzing data recorded in SCOM without a lot of effort.
Let's get started.
Fire up the Operations Manager Console to the Monitoring section, then open the Windows Computers view (or any section where you access the computer health view, such as SQL)
Find a server you're interested in or simply select one from the list.
When selected, a graph should display such as the following:
Open up notepad and past the contents, which should look similar to the following:
Save the file with an xml extension
Select the XML file created earlier; accept the import defaults when prompted
This should populate the Excel spreadsheet with an X and Y column. The first column is the date/time stamp and the Y column is the performance data.
This should yield a graph in Excel such as the following:
Right-click on the graph line and select Add Trendline
Generally, accepting the default will paint a trendline that is helpful for finding issues such as a memory leak or consistent data usage on a hard drive. However, you can play around with the trend to perform longer-term forecasts. I added 50 periods to the end of my trend line to see how memory might look in the future after my data set.
Tuesday, December 23, 2014
AD Site Availability Degraded / AD Site Performance Health Degraded
After deploying the Active Directory Management Packs, we had a domain controller start alert spewing. I had not come across anything out there that really dealt with the alert; the warning from this type of event was not in eventid.net either. But it's all figured out now and here is the solution to the perplexing problem I encountered.
You could also title this, "How to Perform an Online/Offline Defragmentation of your Health Service Store in System Center".
Problem Description:
First, the SCOM console began to fill up with AD Site Availability Health Degraded and AD Site Performance Health Degraded critical alerts from the Active Directory Management Packs.
On the offending domain controller, I observed the following Application event log spewing:
The contents of the warning were as follows:
Troubleshooting:
Initially what I suspected was that I had an application or process going bonkers on the server, taking up memory and causing the SCOM agent to malfunction or be starved of resources. I loaded the Systernals Process Monitor utility to see what was happening when these events fired off, since typically it only took a few minutes in between each event. What was captured was a significant amount of file activity from the Health Service to
When this completed, my HealthServiceStore.edb file went from 174MB to 27Mb and both the warnings in the local Application event log and the critical health alerts in the System Center Operations Manager Console went away.
You could also title this, "How to Perform an Online/Offline Defragmentation of your Health Service Store in System Center".
Problem Description:
First, the SCOM console began to fill up with AD Site Availability Health Degraded and AD Site Performance Health Degraded critical alerts from the Active Directory Management Packs.
On the offending domain controller, I observed the following Application event log spewing:
Troubleshooting:
Initially what I suspected was that I had an application or process going bonkers on the server, taking up memory and causing the SCOM agent to malfunction or be starved of resources. I loaded the Systernals Process Monitor utility to see what was happening when these events fired off, since typically it only took a few minutes in between each event. What was captured was a significant amount of file activity from the Health Service to
C:\Program Files\Microsoft Monitoring Agent\Agent\Health
Service State\Health Service Store\HealthServiceStore.edb . Essentially, there was no other process at the time of these warnings or corresponding alerts in the System Center Management Console that could account for issues on the system.
With the smoking gun being the Health Service Database, I performed some quick online maintenance from within the console to start.
In the Operations Manager Console, I started by browsing to the Operations Manager folder, then Agent Details and selecting the Agents by Version view.
Selecting the offending computer brought up the Health Service Tasks I could perform, Start Online Store Maintenance, being the one I was looking for.
Final Solution:
Unfortunately, the online store maintenance was not adequate enough to remediate the errors and warnings I was encountering so I opted for an offline defragmentation of the Health Service Store database. Perform the following if local warnings persist on the client system.
Unfortunately, the online store maintenance was not adequate enough to remediate the errors and warnings I was encountering so I opted for an offline defragmentation of the Health Service Store database. Perform the following if local warnings persist on the client system.
- Login to the offending client system via console or RDP
- Open an administrative command prompt
- Change directory to "C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Health Service Store"
- From the service console (services.msc) or from command prompt (net stop “Microsoft Monitoring Agent”), stop the Microsoft Monitoring Agent service
- Run esentutl /r edb (without this, you likely won't be able to perform a defragmentation)
- Next, run esentutl /d HealthServiceStore.edb
When this completed, my HealthServiceStore.edb file went from 174MB to 27Mb and both the warnings in the local Application event log and the critical health alerts in the System Center Operations Manager Console went away.
Wednesday, August 6, 2014
Problems with the 2012 R2 Web Consoles
This post is a little long, but I wanted to include as much pertinent error information as possible to help folks properly identify if they are encountering the same type of issue.
Recently upgraded our systems to SCOM 2012 R2 and encountered some issues with client connectivity to the web console. SQL is on a separate system from the management console. Web and Management Console is on the same system (for perspective on how our systems are distributed).
First, let's start with some of the errors I was seeing:
From a client, attempting to connect to the AppAdvisor console:
Error on the client:
Warning on the SCOM management server when connecting to the AppAdvisor console:
Similarly, I received that error when connecting to the AppDiagnostics site as well:
And finally, on the primary /OperationsManager web console, I'd receive an authentication error. The client would be prompted multiple times for a username and password and eventually bomb out.
Solving the problem.
First step was a prerequisite for both the AppAdvisor and AppDiagnostic issues.
Recently upgraded our systems to SCOM 2012 R2 and encountered some issues with client connectivity to the web console. SQL is on a separate system from the management console. Web and Management Console is on the same system (for perspective on how our systems are distributed).
First, let's start with some of the errors I was seeing:
From a client, attempting to connect to the AppAdvisor console:
Error on the client:
Warning on the SCOM management server when connecting to the AppAdvisor console:
Event code: 3005
Event message: An unhandled exception has occurred.
Event time: 8/5/2014 9:38:10 AM : Event time (UTC): 8/5/2014 4:38:10 PM : Event ID: 20964fc40f3c43348ccff13e467e259a : Event sequence: 7 : Event occurrence: 1 : Event detail code: 0 : : Application information: : Application domain: /LM/W3SVC/1/ROOT/AppAdvisor-1-130517302775480349 : Trust level: Full : Application Virtual Path: /AppAdvisor : Application Path: C:\Program Files\Microsoft System Center 2012 R2\Operations Manager\WebConsole\AppDiagnostics\AppAdvisor\Web\ : Machine name: SCOM-MS01 : : Process information: : Process ID: 4332 : Process name: w3wp.exe : Account name: NT AUTHORITY\NETWORK SERVICE : : Exception information: : : Exception type: WebException : Exception message: The request failed with HTTP status 401: Unauthorized.: : Request information: : Request URL: http://scom-ms01/AppAdvisor/Pages/ReportService/ReportServicePageImpl.aspx?_r=&_c=g&_pg=436ac5a4-3e70-41b9-9fe1-5a5c96724dc0&_s=2C369460 : Request path: /AppAdvisor/Pages/ReportService/ReportServicePageImpl.aspx : User host address: User: Is authenticated: True : Authentication Type: Forms : Thread account name: NT AUTHORITY\NETWORK SERVICE : : Thread information: : Thread ID: 17 : Thread account name: NT AUTHORITY\NETWORK SERVICE : Is impersonating: False : |
Similarly, I received that error when connecting to the AppDiagnostics site as well:
Event code: 3005 Event message: An unhandled exception has occurred. Event time: 8/5/2014 9:32:02 AM Event time (UTC): 8/5/2014 4:32:02 PM Event ID: 67e2d2ba9c4842c3bc041c62bad932e3 Event sequence: 8 Event occurrence: 1 Event detail code: 0 Application information: Application domain: /LM/W3SVC/1/ROOT/AppDiagnostics-2-130517299136496487 Trust level: Full Application Virtual Path: /AppDiagnostics Application Path: C:\Program Files\Microsoft System Center 2012 R2\Operations Manager\WebConsole\AppDiagnostics\Web\ Machine name: SCOM-MS01 Process information: Process ID: 8048 Process name: w3wp.exe Account name: IIS APPPOOL\OperationsManagerAppMonitoring Exception information: Exception type: OleDbCommandException Exception message: Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'. Command text: Select CONFIGID, CONFIGNAME, CONFIGVALUE From apm.CONFIG Connection: Provider=SQLOLEDB;Server=scom-sql;database=OperationsManager;Integrated Security=SSPI; Request information: Request URL: http://scom-ms01/AppDiagnostics/Pages/Authenticate.aspx?ReturnUrl=/appdiagnostics Request path: /AppDiagnostics/Pages/Authenticate.aspx User host address: User: Is authenticated: False Authentication Type: Thread account name: IIS APPPOOL\OperationsManagerAppMonitoring Thread information: Thread ID: 9 Thread account name: IIS APPPOOL\OperationsManagerAppMonitoring Is impersonating: False |
And finally, on the primary /OperationsManager web console, I'd receive an authentication error. The client would be prompted multiple times for a username and password and eventually bomb out.
First step was a prerequisite for both the AppAdvisor and AppDiagnostic issues.
- Open the IIS console on the web console server
- Select "Application Pools"
- Select "OperationsManagerAppMonitoring"
- If you are receiving the errors and the application pool "Identity" is set to "ApplicationPoolIdentity", with the OperationsManagerAppMonitoring pool highlighted, select "Advanced Settings" option in the action pane.
- Under "Process Model", change the Identity from ApplicationPoolIdentity to "NetworkService"
- Run an IISReset at an administrator (elevated) command prompt
- Open the IIS console on the web console server
- Select and expand the site (Default Web Site on my server) where the Operations Manager web console is installed.
- Select the virtual directory named "AppAdvisor"
- Open the "Authentication" applet
- If not already enabled, enable the "Anonymous" and "ASP .NET Impersonation" methods
- Run an IISReset at an administrator (elevated) command prompt
- Open "Internet Options" in Internet Explorer
- Select the "Advanced" tab
- Scroll almost all the way down and uncheck the box for "Enable Integrated Windows Authentication"
Friday, February 7, 2014
SCOM 2012 Failed Accessing Windows Event Log with Veeam Management Pack
Noticed during a routine health check that our two Management Servers were showing a warning state. Error read as "Failed Access Windows Event Log" <management server 1> (Health Service).
Error details show the following:
The Windows Event Log Provider is still unable to open the Veeam Collector event log on computer 'management server 1'. The Provider has been unable to open the Veeam Collector event log for 720 seconds. Most recent error details: The specified channel could not be found. Check channel configuration. One or more workflows were affected by this. Workflow name: many Instance name: many Instance ID: many Management group:
We have the Veeam management pack for SCOM loaded and sure enough, this appears to be a documented issue on the Veeam knowledge base.
http://www.veeam.com/kb1496#/kb1496
Error details show the following:
The Windows Event Log Provider is still unable to open the Veeam Collector event log on computer 'management server 1'. The Provider has been unable to open the Veeam Collector event log for 720 seconds. Most recent error details: The specified channel could not be found. Check channel configuration. One or more workflows were affected by this. Workflow name: many Instance name: many Instance ID: many Management group:
We have the Veeam management pack for SCOM loaded and sure enough, this appears to be a documented issue on the Veeam knowledge base.
http://www.veeam.com/kb1496#/kb1496
Thursday, November 21, 2013
Windows 2012 WMI Hotfix
Had a 2012 Server that was being monitoring by System Center lock up on us today. Suspect a WMI leak. Hotfix deployment, engage!
http://support.microsoft.com/kb/2790831/en-us
http://support.microsoft.com/kb/2790831/en-us
Friday, November 15, 2013
SCOM 2012 Powershell - Retrieving a List of Computers in a Group
Had to search for a batch file that is on one of the many SQL servers we have in the environment. First inclination was, let me pull the systems from SCOM since it has all our SQL servers.
Poked around the interwebs a while and noticed a lot of scripts had references to 2007 commands that hadn't been updated to 2012. Here's the basic steps taken to get my group of SQL servers. You could perform the same task on pretty much any group in the same manner.
Poked around the interwebs a while and noticed a lot of scripts had references to 2007 commands that hadn't been updated to 2012. Here's the basic steps taken to get my group of SQL servers. You could perform the same task on pretty much any group in the same manner.
- Open the Operations Manager Shell powershell console
- Type in : Get-SCOMGroup
- Search for the group you want to retrieve members from
- Now type in: $Group = Get-SCOMGroup | where {$_.DisplayName -eq "SQL Computers"} (or insert the group your looking for instead of SQL Computers")
- Next, type in: $Members = $Group.GetRelatedMonitoringObjects()
- Now, you can simply type: $Members
- Or, pipe the command out to a file: $Members | Sort DisplayName | FT DisplayName | out-file C:\Scripts\Servers.txt
Subscribe to:
Posts (Atom)