A comment was posted on my blog post The benefits of attending PASS realized! asking for a follow up as a result of investigating this problem. In that post, I mentioned that a high number of unused connections in oversized connection pools were causing the problem. Unfortunately that information was incorrect, and thanks to some emails with Bob Ward, I now understand the problem better and have been able to reproduce the issue on a VM on my laptop by changing the ‘max worker threads' sp_configure option to the minimum value of 128.
So what really was the root of the problem? Along with MOSS 2007, two other major applications had databases on this SQL Server. One of the applications has two connection pools to SQL Server, each with one hundred connections each and when it activates, the volume of concurrent requests to the SQL Server were in the range of 150 – 190 requests at the same time. The other application has a single connection pool and generally runs in the range of 20 concurrent connections to SQL Server. The database for this application has a 353,000,000 row logging table that is roughly 62GB in size. This table is rarely used beyond inserting new log records, and purging records that are older than 45 days old. It exists for troubleshooting problems associated with data transmission from our stores to our home office. However, when it is queried, the queries execute in parallel, and the resulting query plan would use 6 to 8 workers to execute, depending on the request being made. The root of the problem is that the volume of concurrent requests to SQL Server exceeded the number of available worker threads at the point that the problem occurred.
To track this problem proactively, a couple of performance counters can be monitored. Previously I posted that failed logins, coupled with high wait times for the THREADPOOL wait type can be used to troubleshoot the problem. However, another method exists to investigate the problem. It happens that whenever a process waits for a worker thread to execute that a series of performance counters are updated to reflect this state. The “SQLServer:WaitStatistics\Wait for the worker” performance counters can be collected over time to see when and how often the problem occurs.
To fix the issue, we’ve moved the MOSS databases onto a dedicated SQL Server to separate the workloads which has resolved the problem. Essentially the size of the SQL Server isn’t sufficient for the volume of concurrent work, when all three of the applications are under load together. We can actually pinpoint the point in time when this problem would have started in our environment from our deployment tracking software. Over the summer changes were made in the MOSS farm to add second web server behind our load balancer which increased the amount of concurrent work beyond the tipping point for the SQL Server.
Once again, I want to thank Bob Ward for his assistance with this as well as for reviewing the contents of this post for accuracy before I posted it.