THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Visualizing Data File Layout III

This is part three of a blog series illustrating a method to render the file structure of a SQL Server database into a graphic visualization.

Previous Installments:

Part 1

Part 2

Those that have been reading this series might be be thinking, “Is he going to go there?” Well, the answer is “Yes.” This is the GUID clustered index post that had to be. It’s inevitable with this tool.

If you follow SQL Server at all, you are probably aware of the long-standing debate about whether it is wise, desirable, smart, useful, or what have you, to identify rows using GUIDs. I won’t take a position on that, but I will show here, I hope objectively, a few things that the visualizer shows about file layout vs. distributed inserts, distributed inserts being one of the main challenges around using GUIDs as clustering keys. Just to recap the argument very, very briefly:

Advantages

GUID keys can be generated at the client, which saves a round-trip to the database server to create a collection of related rows.

GUID keys can make certain architectures like sharding, or peer to peer replication, or merging multiple source databases, simpler.

Disadvantages

GUID keys are wider, therefore they take more space in memory and on disk. The additional space is multiplied by their presence in both clustered and non-clustered indexes if they are a clustering key.

GUID keys don’t only take more space in RAM and on disk because of their width. They also cause distributed inserts into the clustered index – that is, new rows are added to any and all pages in the index. Each time a row has to be added, the target page must be read into memory, and at a checkpoint, the whole changed page (both existing and new rows) must be written to disk. This has two effects:

  1. The amount of RAM and disk IO required for inserts is probably much higher, as pages with existing data must come into cache, get changed, and then be written back out again. Essentially, large parts of the table have to be rewritten to disk to append rows to pages that have data already.
  2. The pages that store the index will individually fill up, and have to split such that half the existing rows are written back out to the “old” page and half written out to a “new” page in a different location on disk. This causes the pages to be less full, the same number of rows to require more space on disk and in RAM, and the resulting index to be massively fragmented on disk.

I am not writing to argue these points, which have I think been established by both sides of the debate, only to see if the visualizer shows these effects clearly. Most of the argument isn’t actually about these facts (they are all true, as far as I know) but rather which are more important, and I think that is the main source of debate on the issue.

Visual Example of Distributed Inserts

It’s very easy to create an example of this with a small sample database. I created one called “VizDemo2.” VizDemo2 has a slightly modified structure to illustrate what’s going on here – I need two tables that are stored separately on disk, so that they cannot interfere with one another. The simplest way to do that is with a couple of file groups containing one file each. So here’s the structure:

  1. I created the database with a 50MB, single file, Primary file group
  2. I added a file group FG1 with one 75MB file
  3. I added a second file group FG2 with one 75MB file

When the database is empty, the visualizer shows only the system pages at the start of each file, as shown here:

VizDemo2_01

To that database I added two sample tables identical in structure but with different clustering keys:

USE VizDemo2
GO

CREATE TABLE dbo.SampleCustomersInt  ( 
    id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
    buncha char(500) default 'A', 
    big char(500) default 'B', 
    vals char(500)  default 'C'
) ON [FG1];
GO
CREATE TABLE dbo.SampleCustomersGuid  ( 
    id uniqueidentifier NOT NULL PRIMARY KEY CLUSTERED DEFAULT NEWID(), 
    buncha char(500) default 'A', 
    big char(500) default 'B', 
    vals char(500)  default 'C'
)ON [FG2];
GO

I’ll populate the two tables and we can see what the file layout looks like afterward:

INSERT dbo.SampleCustomersInt DEFAULT VALUES;
INSERT dbo.SampleCustomersGuid DEFAULT VALUES;
GO 20000

Compare

After inserts, the resulting graphic does show some facts we know to be true:

VizDemo2_02

First, the data in the integer-clustered index takes about eight bands of the diagram, while storing the same data in a GUID clustered index has required about twelve bands of data pages. The database itself supports that impression with space allocation – it reports these figures:

VizDemo2_03

Part of the extra space required is the width of the key, but part of it is the empty space on each page resulting from page splits. If a page that needs a new row is too full, then half the rows from that page are moved to a net-new page, half left in place, and the new row added to one or the other of the resulting pages. Afterward, they are often both partly empty.

Second, the whole graphic in the GUID clustered index area is a dark blue that the visualizer uses to show fragmentation – in fact, the object is almost perfectly fragmented, with practically no contiguous pages at all. The sequence of pages in the leaf level of the index is still a linked list, as always, but it it is physically stored in essentially random order on disk.

Does Re-Indexing Help?

The next question is whether we can combat these problems by doing huge amounts of index maintenance – if we rewrite the GUID index, will that make it take less space, or make it more efficient? The answer is, “well, sort of, temporarily.”

First, re-indexing will put the table in “GUID” order. Whether that really helps or not is debatable, perhaps. It would enable read-ahead for the index, which is otherwise clobbered by the fragmentation. Having the table in “GUID” order might or might not be of any help to performance. Second, re-indexing will make the pages denser, or less dense, depending on the fill factor applied. For the sake of demonstration, let’s re-index with the default fill factor, because I think that happens a lot out in the world, and it may tell us something:

ALTER INDEX ALL ON dbo.SampleCustomersGuid REBUILD;

After re-indexing, this is a view just of the second file group with the GUID clustered table (note that I scrolled down in the display):

VizDemo2_04

The arrow shows where the data was moved from the old data pages into a new region of the file. And, sure enough, it’s not fragmented (note the lighter color) and it takes less space in the file.

That might sound good, but if this is a real database, inserts probably will continue. In the int clustered case, as we know, new data will be appended to the end of the page sequence, but in this case, new data will have to be inserted into most of the existing pages on disk. Those are all full now, and will have to be split 50/50 to create new pages for the new data, both the old and new pages will have to be written out, and the new pages by definition can’t be in index order with the existing pages.

INSERT dbo.SampleCustomersGuid DEFAULT VALUES;
GO 20000

What we get after more rows are added to the table is what a layperson might call a “hot mess:”

VizDemo2_05

Here everything is fragmented – back to that dark blue – even the pages we just re-indexed a moment ago, because they all split. The table has more than doubled in size, even though we just doubled the number of rows, because the individual pages contain less data.

Would appropriate fill factor be a workaround? In some measure, yes, but it really only combats the issue. The write activity on the table, even with a low fill factor, will still be higher as more existing pages have to be flushed at checkpoints. The pages will still be less dense, and therefore take up more space on disk and in cache. In short – maybe helpful but no silver bullet.

What about Sequential GUIDs? Here I will venture my opinion. Sequential GUIDs have never made sense to me. They solve one part of this problem – the distributed insert part – but at the expense of the very things GUIDs might be good for, namely not demanding a visit to the database to generate an identifier. If you have to come to the database, you already lost this whole argument. Use an integer and solve the rest of the problem at the same time. I can only see it as a sort of band-aid for existing systems that could not be refactored, but, like a bad SUV that combines the worst properties of a car and a truck, it feels like a really poor compromise to me.

I hope this helps to illustrate some of the physical database design challenges that surround the use of GUID cluster keys. In the next installment I’m planning to demonstrate the interleaving of objects, which is one argument for multiple file groups.

Published Monday, January 28, 2013 10:45 PM by merrillaldrich

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Neeraj Mittal said:

Nice explanation of complex topic

January 29, 2013 8:07 AM
 

csm said:

Great tool for academic purposes!! Nothing better than a graphic to explain complex topics.

Hope that you can share this tool with us.

January 29, 2013 12:27 PM
 

merrillaldrich said:

Thanks guys. The public version is in the works.

January 29, 2013 10:28 PM
 

rivantsov said:

well, nice post but... sort of accepted truth for an experienced data guy. We use Guids for PK exclusively, but knowing the effect of random clustering, we explicitly specify clustered index and it starts with some natural sequential field like "CreateDateTime". So the immediate response to your concerns is "Guids as PK are OK, just make sure you add sane clustered index". This seems like a complete solution, but... I myself have some concerns - it does not seem to always work well.

Example: two tables: Order, OrderLine; parent/child, both use Guids for PK, and OrderLine references Order thru OrderId (Guid). Good practice is to base clustered index of OrderLine on OrderId - so that all lines for an order will be together, and will improve perf when we query for order with lines. But now we have this Guid as clustered index again - it means that chunks of lines will be allocated on disk with fragmentation as you describe. It will be interesting to find a solution... I'm thinking of even maybe adding artificial 'clustering' auto-inc int column that is base of clustered index and is propagated from parent to child - so child clusters as well. any ideas?

And abandoning Guids is not an option - our app does a lot of sync-ing of databases, and autoinc/identify for PK is not an option.

Another interesting question - what's the impact of Guid randomness on 'index' performance?

February 3, 2013 1:17 AM
 

rivantsov said:

oh, sorry, I misunderstood. you're demoing the tool... it's not about guids/pk per se

February 3, 2013 1:44 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement