THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Michael Zilberstein

For want of a nail

For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the battle was lost.
For want of a battle the kingdom was lost.
And all for the want of a horseshoe nail.

 

I'm now in the middle of severity A case that is best depicted by this rhyme. While Microsoft engineers look for a root cause, I'll describe here how small bug can "kill" strong server. The story begins when for some unknown reason "rows" column in sys.partitions DMV begins to show 4.6 billion rows for clustered index of one of the entities tables in my datawarehouse database. COUNT(*) on the table returns 1 million rows.

Second step - daily job that executes sp_updatestats stored procedure. This procedure updates statistics in the database using sample data . Let's say, I randomly took 5% of table's data and calculated number of unique values for Column1 - n values. So total unique values of Column1 in my table is 20*n. But how do I know, how many rows per unique value of Column1 are there in the table? Very simple: SQL Server just takes rows column from sys.partitions and divides by our 20*n number.

Third step - query: it joins between our Table and another one WHERE Column1 = x. Optimizer compiles the query and builds execution plan based on existing statistics (e.g. 4 billion rows in the table). Expected number of rows from our table is millions - so Optimizer chooses HASH JOIN as the best way to perform JOIN operation.

Now that Optimizer made a decision about execution plan, it asks for a memory quota for query execution. Based on the estimation about millions of rows. Remember that memory quota for query execution isn't taken from data buffer cache, so on 32 bit systems we can't utilize too much memory. For example, I executed single query as described earlier and while it was running checked memory usage of my session.

image

So we have 252Mb granted while less than 1Mb used. Total available memory is 1.6Gb, so you can calculate yourself how many concurrent queries system supports at its current state. It is exactly the situation we had: 5 sessions were running and 65 were waiting for memory grant (if you monitor wait types - there will be a lot of RESOURCE_SEMAPHORE).

P.S. DBCC UPDATEUSAGE fixes the initial problem. So while Microsoft engineers look for the reason why sys.partitions went wild, I monitor it in order to fix the problem before it "kills" the server.

Published Tuesday, December 08, 2009 3:55 AM by Michael Zilberstein

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Sankar Reddy said:

Michael,

Does this table/database went thru migration from SQL Server 2000 recently?

December 8, 2009 12:46 AM
 

Michael Zilberstein said:

Hi Sankar,

Thank you for an answer - Microsoft support engineers asked me the same question. The answer is no, it was created 3 years ago on SQL 2005.

And we also don't have ddl operations besides updates of statistics running on this particular table (another question Microsoft support engineers asked).

December 8, 2009 2:29 AM
 

LeoPasta said:

I had similiar problems with heap tables in the past (SQL 2000 though), it probably isn't the same issue, but if it is the case, creating a clustered index and dropping it afterwords solves the issue (although I would rather leave it permanently).

December 8, 2009 5:56 AM
 

Greg Linwood said:

Why are you running a data warehouse with millions of rows on only 1.6Gb available memory?

December 8, 2009 6:21 AM
 

Michael Zilberstein said:

Leo, it was the first thing that Microsoft support engineers asked - whether this database was originally created on SQL 2000 and upgraded to 2005. That's not my case though - database was created on SQL 2005 and index that showed 4 billion rows was clustered one.

December 8, 2009 7:46 AM
 

Michael Zilberstein said:

Greg,

That's 32 bit server with 16Gb of RAM out of which SQL is limited to 14Gb according to best practices. The problem is - memory quota for query execution isn't taken from data buffer cache. That's not data itself but in-memory tables created during query processing (hash tables, lazy\eager spools etc) that take this space. And it is limited to the leftover memory e.g.  a little less than 2GB (which it also shares with Windows). This problem doesn't exist on 64 bit server.

December 8, 2009 7:53 AM
 

Alejandro Mesa said:

Michael,

Thanks for sharing this experience with us. Please, let us know what was the culprit, if Microsoft get back to you.

AMB

December 8, 2009 10:24 AM
 

Michael Zilberstein said:

Alejandro,

Meanwhile the don't have much to say - I uploaded backup of the database from that day, so they see a problem but can't say what caused it. The only interesting finding is that other 2 customers have reported the same behavior on the same SQL Server build (9.00.3161) - they too were unable to reproduce the bug, so tickets have been closed without solution. I hope, we won't succeed in reproducing it either. :-)

December 8, 2009 6:57 PM
 

Kalen Delaney said:

Very interesting post to follow on to my post about counting rows in http://sqlblog.com/blogs/kalen_delaney/archive/2009/12/07/how-many-rows.aspx; I usually think the sys.partitions value is 'close enough' but obviously that is not always the case!

December 9, 2009 6:09 AM
 

Michael Zilberstein said:

Kalen,

You were talking about _normal_ behavior in your post. What I have here is definitely SQL Server bug. See also my answer to Alejandro Mesa earlier - there were 2 other customers that reported same behavior on this particular build of SQL Server.

By the way, this is how I found out that SELECT FROM sys.indexes JOIN sys.partitions blocks - my workaround for the bug was a job that checks rowcount every 5 minutes for this particular table and executes DBCC UPDATEUSAGE when necessary. This job failed couple of times as deadlock victim. I specified deadlock details (resources in contention) in the comment to your post.

December 9, 2009 6:26 AM
 

Martin Smith said:

BTW this value can be manually updated

CREATE TABLE T

(

X INT

)

SELECT rows

FROM sys.partitions

WHERE object_id = object_id('T')

update statistics T with rowcount = 1000000, pagecount = 100000

SELECT rows

FROM sys.partitions

WHERE object_id = object_id('T')

DROP TABLE T

April 18, 2012 5:14 AM
 

Martin Smith said:

April 18, 2012 5:15 AM
 

Michael Zilberstein said:

Martin,

you're right generally. I know about this feature but use it only when I want to examine "what if" scenario - how would execution plan change if my table grows 100 times? It is definitely not the feature I want to use in production environment.

April 18, 2012 8:35 AM
 

Len Berkholtz said:

Did Microsoft ever get back to you on the bug?

Len

August 19, 2014 2:57 PM
 

Michael Zilberstein said:

Len,

Microsoft told us that 2 other users reported same bug in the same SQL Server version and that nobody experienced anything like this in other versions. So since they couldn't find the root cause and since it seems that in other version bug doesn't exist, they recommended us to upgrade to next service pack. We did an upgrade and indeed the bug never returned.

August 21, 2014 3:19 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement