OK, we’re way over 90 days in with our SCOM implementation, but I picked a title theme, and now we’re kind of stuck with the title. In any case, today is a short and to-the-point post about how to get Agent jobs alerting on failure through Operations Manager. Others, including Thomas LaRock, have covered much of this before so I will try not beat a dead horse here.
Out of the box, the SQL Server Management Pack doesn’t have SQL Agent job discovery or alerts turned on. And every DBA I know wants to know when SQL Agent jobs fail, unless they work in such a catastrophically awful environment that jobs are made to be ignored, and fail either all the time, or by design. Personally, I want to know about every failure, and then maybe exclude some specific problem jobs or servers using overrides.
So here’s the skinny, to get to alerting on individual job failures:
- You must override the discovery of SQL Agent jobs so that they are picked up by SCOM at all, when you first configure the management pack. See http://skaraaslan.blogspot.com/2011/08/how-to-monitor-sql-agent-jobs-using-sql.html or Google for how to do that. If not, individual jobs are ignored, and only the state of the SQL Agent service itself is discovered and reported.
- Once the jobs are discovered, it’s tempting to look at the Last Run Status monitor as your savior for catching failed jobs. The problem is, that thing stinks in real life. We tried it in a large environment, and it works only about half the time in real-world scenarios. It makes a kind of half-hearted check every so often, and seems oblivious to details like jobs being enabled or disabled. Not granular enough.
- You’ll want to enable the rule, which is disabled by default, SQL 200x Agent > An SQL Job Failed to Complete Successfully. This will ring a bell on each and every job failure, which is what I want. It might be very noisy at first, so be careful when you switch it on, but you will hear about each and every failed job outcome.
Lastly, it’s helpful to set all your jobs to write to the event log on failure. I know what you may be thinking, “How am I going to go turn that setting on in all the jobs I have across the enterprise?” Answer: multi-instance query and some dynamic SQL. The basic query we want to run is fairly simple:
How do I know where to find that code? I cheat: I open the associated dialog box in SSMS, against a test server, make the change for one object, and script out the change instead of applying it. I then hit BOL Search with the code from the resulting script, to be sure I know how the script works and understand what’s going on. The quality of SSMS-generated code is uneven, so it’s very important to understand how things actually work, but this is a simple starting point.
Next, we need a script that will run that stored proc for all SQL Agent jobs. A simple way to do that is just to select from a system table or view that has all the required objects, and use the result to compose a series of statements, one for each object. If this gives us the list of Agent job names:
Then this combination will compose a series of sp calls and execute them:
DECLARE @sqlcmd nvarchar(MAX);
SET @sqlcmd = '';
SELECT @sqlcmd = @sqlcmd
+ 'EXEC msdb.dbo.sp_update_job @job_name=N'''
+ REPLACE([name], '''', '''''')
+ ''', @notify_level_eventlog=2; '
-- BE VERY CAREFUL WITH THIS:
EXEC( @sqlcmd );
Note the crazy number of escaped single quotation marks – common in Dynamic SQL, some method often is required to handle quotation marks that might be in the names of objects being manipulated. The code is ugly, and would not be happy production code for a real application, but for a basic administrative task it’s quite serviceable.
With this combination, you should have events in the log for each job failure, and a working rule in SCOM to collect that information and alert as appropriate.