Wednesday, May 16, 2012
Monitoring Theory (Murphy's Law of Monitoring)
Yeah, I just created a new monitoring theory. Ok, might not be new but it is the philosophy of how I often configure System Center as the monitoring platform of choice. As a systems admin, we all hate the late night alerts that essentially mean nothing. So how do you prevent those alerts from hitting your phone whilst in the comfort of your sleep? How should monitoring be approached? Do you care that once a day, a server's CPU spikes and generates a critical alert? I take this approach to alerting and monitoring, transient problems should be collated over a period of time to see if there is a long term trend that needs addressing; alerts, especially after hours, should be comprised of site, server or service down plus hard disk space issues. Ultimately, are these not the events that will have the corporate director or customer calling you in the morning to chew your head off? So start your approach there, whether specifically subscribing to those alerts individually or lowering the severity of other alerts (cpu utlization, disk slowness, etc.). As for that transient information, create an SLA report with somewhere between a 90-95% value. Why? Well trying to get a server to have 99% acceptable values or CPU utlization could get expensive and likely be a waste of resources when the server is not busy the other 80% of the time. What the SLA gives you is a trending value for the alert that you do not necessarily care about if it happens once or twice. However, if it is consistent enough over the period of a month or quarter, then you may want to look at some upgrade planning or optimization of the server. The SLA reports also allow you to look at the health of all monitored devices at once, instead of ad hoc alerts that come in for a transient condition.