There are situations where guaranteed accuracy correctness is essential,
which is why we have transactional databases meeting the ACID properties. And
then there are situations where we just need an approximate number. An example is the size of the
SQL Agent job history log.
I bring up this topic because on more than one occasion,
after fixing performance problems at a client site, the remaining top waits are
two SQL statements belonging to stored procedures in msdb: one in
sp_sqlagent_log_jobhistory,
and one in sp_jobhistory_row_limiter.
The average wait time is on for the day of Feb 28 was 207ms
and 201ms respectively. The average elapsed times over several weeks are 171ms
and 170ms respectively. Neither of these show meaningful CPU consumption,
at 0.716ms and 0.124ms respectively.
A third statement, in
sp_jobhistory_row_limiter,
actually has much higher CPU consumption. This statement contributes 33.5ms CPU and 39.4ms elapsed time
per call over the long period.
In terms of overall load on the system, these msdb
procedures are negligible. A modern 2-socket system has 12-16 physical cores,
so there are 288-384 CPU-hours available per day. The CPU intensive query
consumes 40 min per day (33.5ms/call x 71,509 calls per day), a very small
portion of the available CPU.
The wait time of the 2 longer running queries may be
significant at 4 hours per day each. The calls are from SQL Agent, not a live
person waiting for the response, so the wait time is not a problem by itself.
The high wait statement in sp_sqlagent_log_jobhistory is:
INSERTINTO
msdb.dbo.sysjobhistory
(
job_id,
step_id,
step_name,
sql_message_id,
sql_severity,
message,
run_status,
run_date,
run_time,
run_duration,
operator_id_emailed,
operator_id_netsent,
operator_id_paged,
retries_attempted,
server)
VALUES (
@job_id,
@step_id,
@step_name,
@sql_message_id,
@sql_severity,
@message,
@run_status,
@run_date,
@run_time,
@run_duration,
@operator_id_emailed,
@operator_id_netsent,
@operator_id_paged,
@retries_attempted,
@server)
GO
The high wait statement in sp_jobhistory_row_limiter is:
SELECT
@current_rows_per_job =
COUNT(*)
FROM
msdb.dbo.sysjobhistory
with
(TABLOCKX)
WHERE
(job_id
=
@job_id)
The maximum elapsed times of sp_sqlagent_log_jobhistory and
sp_jobhistory_row_limiter
over the longer period are 11 and 41 seconds respectively. Potentially this
might cause problems in situation that require precise timing. Of course, it is
probably not a good idea to rely on SQL Agent if precise timing is required on
a heavily load database server processing a high volume of jobs.
One might think that at just less than 1 call per second (71,509
calls per day, 86,400 sec per day) the chance of a collision between the two statements
at 0.2 sec each is small. However these are not random calls, but fire on
schedule. So if 100 jobs fire off every 2 minutes on the minute, perhaps the
chance of collision is much higher?
The sp_sqlagent_log_jobhistory
procedure inserts into jobhistory, and then at the end calls sp_jobhistory_row_limiter.
There will not be contention between the two procedures for a given job and session.
So the issue is whether there will be collisions between different jobs. (I am
not sure under what circumstances there would be multiple sessions of 1 job.
Does this occur if the job does not complete within its periodicity or can one
job be controlled by multiple schedulers?)
The typical sequence of statements in sp_jobhistory_row_limiter excluding
conditionals is as follows:
EXECUTE
master.dbo.xp_instance_regread
-- 'JobHistoryMaxRows'
EXECUTE
master.dbo.xp_instance_regread
-- 'JobHistoryMaxRowsPerJob'
BEGIN TRANSACTION
/* this is the high wait statement */
SELECT
@current_rows_per_job =
COUNT(*)
FROM
msdb.dbo.sysjobhistory
with
(TABLOCKX)
WHERE
(job_id
=
@job_id)
WITH
RowsToDelete AS
(
SELECT
TOP
(@rows_to_delete)
*
FROM
msdb.dbo.sysjobhistory
WHERE
(job_id
=
@job_id)
ORDER
BY
instance_id
)
DELETE FROM
RowsToDelete;
/* this is the high CPU statement */
SELECT
@current_rows
=
COUNT(*)
FROM
msdb.dbo.sysjobhistory
/* this is rarely called? */
WITH
RowsToDelete
AS
(
SELECT
TOP (@rows_to_delete)
*
FROM
msdb.dbo.sysjobhistory
ORDER
BY
instance_id
)
DELETE
FROM RowsToDelete;
IF
(@@trancount
> 0)
COMMIT
TRANSACTION
The combination of the SELECT with
(TABLOCKX) and the BEGIN/COMMIT TRAN will block inserts to jobhistory for the
duration of the transaction. So it is quite possible there are collisions with
the sp_sqlagent_log_jobhistory
insert statement.
Below is the detail for the
xp_instance_regread call. For some reason the execution statistics are not available. If this is mainly
called from the sp_jobhistory_row_limiter
procedure, with 2 calls, then there would be 143,000 calls per day with an
average wait of 12.6ms per call.
I am thinking that the reason
for the BEGIN/COMMIT TRAN block in the
sp_jobhistory_row_limiter
procedure
is to prevent the action of one session from deleting rows in the process of
being inserted by another session?
Having limits on both the
total number of rows for job history and individual jobs is a good idea. But
this is an example of a situation where we are not concern about guaranteed
accuracy. In this case there are 150 jobs and 260,000 rows in jobhistory,
requiring 460MB of storage.
If we set total jobhistory at 500,000 rows and individual job history at 1000 rows,
perhaps our intention
We definitely do not need to check and enforce (both) jobhistory
limits 71,500 times per day. One way to reduce the frequency of calls is to
delete. I suggest deleting at minimum 10 rows per job everytime the threshold
is exceeded.
I think it is silly to do a
full count of the number of rows in sysjobhistory for job execution. The SQL
Server SELECT COUNT(*) statement does not cheat. It reports the actual number
of rows in the table by touching the row header of each and every row. So
perhaps we should use an alternative good enough row count, from the
sys.dm_db_partition_stats
row_count column for example?
I would also like to do
away with the BEGIN/COMMIT TRAN block along with the TABLOCKX. Perhaps we can
only delete jobs more than 1 day old ?
/* be sure to make this index change before altering the procedure */
CREATE INDEX nc1
ON sysjobhistory
(job_id)
INCLUDE
(run_date)
WITH
(DROP_EXISTING
=
ON)
GO
So my proposal for the sequence of
statements in jobhistory_row_limter is as follows:
SELECT @today =
CONVERT(
int,
CONVERT(char
(8),
GETDATE(),112))
--BEGIN TRAN
SELECT
@current_rows_per_job
=
COUNT(*)
FROM
msdb.dbo.sysjobhistory
-- with (TABLOCKX)
WHERE
(job_id
=
@job_id)
AND run_date
< @today
SELECT
@rows_to_delete
=
@current_rows_per_job
-
@max_rows_per_job
IF
(
@rows_to_delete
> 0
)
BEGIN
/* if we only delete previous days,
then this will only active once per day */
WITH
RowsToDelete
AS
(
SELECT
TOP
(@rows_to_delete)
*
FROM
msdb.dbo.sysjobhistory
WHERE
(job_id
=
@job_id)
AND run_date
< @today
ORDER
BY
instance_id
)
DELETE
FROM
RowsToDelete;
END
/* do this first */
SELECT @current_rows
=
row_count
FROM
sys.dm_db_partition_stats
d
WITH(NOLOCK)
WHERE object_id
=
OBJECT_ID('sysjobhistory')
AND
index_id < 2
/* only if the above is above threshold */
IF
(@current_rows
>
@max_total_rows)
BEGIN
SELECT @current_rows
=
COUNT(*)
FROM
msdb.dbo.sysjobhistory
with
(NOLOCK)
WHERE
run_date <
@today
SELECT
@rows_to_delete
=
@current_rows
-
@max_total_rows
IF
(
@rows_to_delete
> 0
)
BEGIN
WITH RowsToDelete
AS (
SELECT TOP
(@rows_to_delete)
*
FROM
msdb.dbo.sysjobhistory
WHERE run_date
< @today
ORDER BY
instance_id
)
DELETE FROM
RowsToDelete;
-- ALTER INDEX ALL ON msdb.dbo.sysjobhistory REBUILD?
END
END
--COMMIT TRAN
Even though the overall impact of
the jobhistory limits procedures are insignificant, it is somewhat annoying to
have them show up at the top of the list of a third party performance
monitoring tool. With a few simple precautions, these can be eliminated.
ps
If any one is willing to alter the jobhistory row limiter procedure,
please us know the changes made along with the before and after execution statistics.
I should also add that the sysjobhistory table will get heavily fragmented over time, especially since job_id is a uniqueidentifier, so perhaps periodic index rebuilds are in order. Perhaps our strategy should be: when the max total rows is exceeded, delete a larger percentage, say 20% for example, then rebuild the index (outside the transaction if present),
pps - I may have inadvertently deleted a comment? - oops! - sorry