This is the third installment of a series on our deployment of System Center at my workplace, emphasis on SQL Server MP.
At this point we’ve got Operations Manager installed, and up and running, and we’ve been able to categorize all the monitored servers into production, preproduction, test and DR using groups that have dynamic membership rules. We’ve got the SQL management pack working with out-of-the-box settings, and used it to locate all the SQL Server stack services like the engine, reporting services, integration services, etc. But the out-of-the-box settings don’t work quite exactly for our environment, though they are pretty good. Here are some things we had to add, and you might want to consider for your SCOM config.
Alerts on Job Failure / Long Jobs
The first thing we noticed when watching the SQL MP’s behavior is that SQL Agent job failures don’t alert very effectively. Discovery of specific jobs, and then monitoring of their last run status, may be disabled out-of-the-box. When turned on, they only show as “warning” and not “critical.” This is apparently by design, in order to cut down on the noise in environments with a lot of failing jobs. In place of the monitoring for individual jobs is a generic “some job failed” monitor at the level of the whole SQL Agent for the instance, but it doesn’t detail which job.
In our group, we really need to know if a specific production job fails, so one of the first steps was to create an override for this to make job failures fire a critical alert, so that they would trigger our email and text notifications. This is a pretty straightforward override that is documented on the web – basically you just have to override the severity of the job failure and configure it to trigger an alert. Second, we disabled the default monitor that alerts generically on any job failure, since those would essentially be duplicate alerts.
So, if you are headed down this road:
- Check to see the overall “health” of the SQL Agent jobs in your environment first, to determine if it’s practical to monitor them individually
- If so, it works to substitute monitoring of each job for monitoring only for generic “some job failed” events, by overriding these two monitors
Next, there’s an monitor for long-running jobs. This is probably appropriate for many jobs, but there are some SQL Agent jobs that are designed basically to run forever, and will trigger false alarms. Biztalk, for example, has some maintenance jobs that just fire up and then execute in a loop indefinitely. We had to go through and cherry-pick each of these jobs to override/suppress the long running job monitor.
Missing Perf Counter Collection
For whatever reason, probably because they can’t please everybody all the time, some perf counters that I rely on are not configured in the default management pack, and you might consider adding these as we did.
The first is the Memory Grants Pending counter in SQL Server. This counter basically indicates a specific type of memory pressure where queries are stacking up in a queue waiting for execution memory to become available; it typically happens in a data warehouse type of workload with huge, nasty queries that consume large amounts of execution memory. I happen to have a server that does this fairly often.
We added a custom monitor into the Performance health rollup category for this counter, and set it up as off by default. We then created overrides to enable it on the data warehouse server(s) where this type of thing is an issue:
Setting up a monitor like this can be done simply with a custom monitor creation wizard in the SCOM console GUI – the tricky bit is getting the variables into the dialog box (in this case the Object: field) that allow SCOM to handle the multiple-instance nature of SQL Server correctly. That is, there could be counters for instance1 and instance2 and instance3 on the same server, and the monitoring has to account for that. It manages that by allowing you to put variables into the setup that will be substituted for things like instance name at run time. This article was very helpful on instance-aware monitors: http://blogs.msdn.com/b/boris_yanushpolsky/archive/2007/11/21/opsmgr-sp1-creating-rules-and-monitors-for-multi-instance-components.aspx
Essentially, if you are going to be authoring monitors or rules for anything instance-able (SQL Server services, databases, etc.) then you really have to “grok” how to use the SCOM variables in the correct fields in the monitor or rule setup. It looks complicated, but I found I got the logic once I used it a couple times.
The second is the Page Life Expectancy counter. This is less for alerting, and more for tracking overall instance health and looking for chronic memory pressure.
Database Size. Definitely.
Paul Randal posted some great guidelines about how to monitor for mirroring performance and issues, like http://www.sqlskills.com/BLOGS/PAUL/post/Importance-of-monitoring-a-database-mirroring-session.aspx. If you use database mirroring, it’s logical to want to both gather stats about how it’s performing and be alerted to issues. We’re at the very beginning of that setup, but have at least got rules configured to monitor for the send and redo queue length on our mirrored databases. There’s also a solution you can download and implement here: http://rburri.wordpress.com/2010/09/10/sql-server-db-mirroring-mp-update/
Disable Database Space Monitors on Mount Points
The SQL Server internal functions for reporting OS disk space available don’t work very effectively with mount points, and that issue shows in SCOM’s SQL MP, where the reporting on available disk space per database is just wrong. I’ve already covered this, so I won’t go back over it again. Hopefully this improves in a future version. We have elected to disable the collection of the invalid data, and instead monitor disk space and the space allocated vs. used inside the database files instead. It’s a workaround at best.
Here’s how I implemented a workaround in SCOM:
- I made a group called “SQL Servers that use Mount Points.” No joke. I put in a static list of all the machines I know are configured this way.
- I set overrides on the collection of database free space and log free space to disable them for that group.
- We ensured the DBAs would be alerted on low disk space conditions from the Windows Server components, and we just have to sort of live with that instead.
That’s it for this installment. Next up, monitoring for blocking, which is a much more complicated animal. Happy monitoring!