Once you get past the basic architecture of a SCOM implementation, and build the servers, and so on, the first real problem is … well, noise. Suddenly (depending on how you deploy) the system will reach out, like marching army ants or some very clever cybernetic spider and find, and then proceed to yell at you about, every single problem on every server you didn’t know you had. That, of course, is the point. Still, a tool like this is not useful if it doesn’t surface the real problems from the background noise. And that, it turns out, is an implementation problem and not a part of the tool itself. Left to its own devices, SCOM is quite likely to generate hundreds or even thousands of alerts in a new deployment, and those have to be tamed in some fashion.
Categorizing servers into Production, DR, Test, and Dev to Manage Noise
Surprisingly, perhaps, there’s no out-of-the-box method or even convention in SCOM to differentiate between production, disaster recovery, test and development servers. It would be difficult for Microsoft to predict how people manage this in individual datacenters – one place might have separate networks or IP ranges, others might use server naming, separate domains, etc. etc., so there isn’t really any obvious way to implement this within the product. Fortunately, SCOM has a really outstanding dynamic grouping functionality. The first activity I had to learn and then design around was how to create groups of objects that would dynamically organize monitored servers into meaningful collections, to identify production servers from the other classes.
Our (excellent*) consultant’s recommendation was to create a standardized registry key for all our servers to identify which category each machine belongs in, and then a custom “discovery” in SCOM to collect that data. We could then label each existing or new server by environment using the registry key, and SCOM would then automatically categorize them with the dynamic grouping functionality. Once implemented, like magic, all the monitored server objects in SCOM move into the appropriate groups. Any new server, if built with the reg key, automatically lands in the right group.
The SQL Server management pack also has a wonderful built-in group, SQL Computers. Any server that SCOM discovers has SQL components installed automatically becomes a member of the SQL Computers group. For SQL Server and our DBA team, I then made more specific, dynamic SQL Server groups that are populated with a membership rule like
Prod SQL Servers: [machines in the SQL Computers group] AND [our reg key = “Production”]
Test SQL Servers: [machines in the SQL Computers group] AND [our reg key = “Test”]
Those groups can then drive the priority and channel for alerts (Production needs to wake someone up at 2am, but Test doesn’t) and also the organization of views and dashboards in the SCOM console UI.
Another implementation would no doubt be different in the details, but the key here is to make these groups an automated, as opposed to static, list of servers. That way, as a new machine is added to the network, or a new SQL Server component is installed, the components automatically get placed into the right monitoring groups and configuration, without intervention from a person.
I also think it was a good idea to keep the server name out of this equation, because a server name that is overloaded with specific details is perhaps not a great idea in the first place, and our existing servers have no such convention in place. (Example: the first two characters mean “x,” the next three mean “y” and so on. That always seems to break two days after you make up the standard.)
Sounds Fabulous, but How?
Implementation challenge: in order to create a dynamic rule for group membership that has logic like that listed above, it turned out we could not use the GUI in SCOM for the group definition, but instead had to export our SQL customization Management Pack to XML and edit the rule as text in the XML file. Fortunately, I found the XML to be pretty clear, and there’s good information online from SCOM experts that even I could follow, to make some simple group membership rules like:
It took a little research and some trial and error, but we have dynamic groups of all SQL Servers separated into production, DR, preprod, test and dev collections, and new servers automatically dropping into the right group when built. Win! From that core bit of configuration, we can prioritize alerts and customize views per environment. Test and dev are still noisy, but don’t interfere with information about production servers. I think if you are headed down the SCOM road, this is a must, even if it’s hard the first time through. It’s a major advantage once set up.
Which OS counters or alerts do the DBAs need / want to see?
In our organization it’s important that the DBAs know about server basics like OS and hardware health. I get the sense that each organization is different, so this would be a policy decision that one would face early in a SCOM implementation and would directly inform how the system is configured. In some places, the DBA’s role might be more confined to SQL Server proper, with some other group handling basic OS issues.
We started with the SQL Computers dynamic group described above, shipped in the SCOM SQL Management Pack, to categorize all our Windows servers such that the DBAs would be able to see and receive alerts about the state of any Windows Server OS that has a SQL Server component (Engine, Reporting Services, Analysis Services, etc.) installed. The basic idea is to notify a DBA of any OS issue on a machine that has SQL Server components, but not other machines. That way if a disk fills up, or CPU use is too high, or what have you, on a SQL Server we’ll know about it in the DBA group immediately.
Then, the dynamic grouping above is further structured so that we have groups of those SQL Computers (as opposed to groups of SQL Instances or SQL Databases) for categories like production or test. That gets the DBAs information about the OS monitoring, but limits it to the machines that have some SQL Server component and subdivides it by environment.
Next, our alerts, views and notifications are set to key off of those groups, but also configured to fire notifications to the DBA group for any class of object from the operating system on down through things like the cluster service, SQL services, SQL databases and so on. In this way, the noise is eliminated from our precious sleep time, but the critical production issues get through.
* I am happy to refer this company or individual. Contact me at merrillaldrich (a) gmail (.) com.
Up Next: Vital Custom SQL Counters and how to add them to SCOM.