THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

  • Quick and Dirty PowerShell SQL Server Load Test

    Today I wanted to share a very short PowerShell script that can be used to simulate running multiple clients against a SQL Server, for simple performance or hardware testing. One thing I love about PowerShell is that so many functions and features are pre-built and easily accessible that I have to write very little code to accomplish a lot. This script uses the background job feature to run asynchronous processes, and the SQL features to invoke scripts stored in files, and I didn’t have to engineer any of that, which is cool. It is out-of-the-box PowerShell 2 and SQL Server 2012 at the client, but would work against basically any version of SQL Server at the server side. Not bleeding edge at all.

    # Start  clients running the same SQL script simultaneously
    
    $numWorkers = 10
    
    $startTime = ( Get-Date ).AddMinutes(1)
    
    $jobscript = {
        param( $startTime )
             
        while ( ( Get-Date ) -lt $startTime ) { Start-Sleep -Milliseconds 100 }
        
        Invoke-Sqlcmd -InputFile 'path to your\sqlscript.sql' -ServerInstance yourSQLServer -Database yourDB -QueryTimeout 0  
    }
    
    (1..$numWorkers) | foreach {
        Start-Job $jobscript -ArgumentList $startTime -InitializationScript { Import-Module "sqlps" -DisableNameChecking }
    }

    The script does just one thing: given a SQL script stored in a file, it will start some number of simultaneous connections running the SQL against a server. Let’s break it down a bit to see how.

    First we establish the number of client connections to use, and identify a single start time for them to begin running, one minute from the current time. This extra minute will give all the threads time to get created/configured so that later we can cause them to start at approximately the same moment:

    $numWorkers = 10
    
    $startTime = ( Get-Date ).AddMinutes(1)

    Next, we need a “mini PowerShell script” inside this script to pass to an async job to run. This “mini script” is actually what the multiple worker threads execute. There are a few ways to make this, but a simple way to handle it is just to assign a code block { } to a $variable:

    $jobscript = { work for the async job goes here }

    The background job feature has some behavior that is not completely intuitive – for one thing, variables in your calling script won’t be accessible from a running job, because the job runs in a completely separate session/environment. For that reason, you have to explicitly pass in any values that you need in the code that the job runs. This is the reason for the first line inside the job script:

    param( $startTime )

    Using this parameter, we can pass the single start time value from the outer script to all the background jobs. Without it the jobs could not “see” the value.

    Next, the body of the job script is simply “wait for the start time, then execute this SQL script”:

        while ( ( Get-Date ) -lt $startTime ) { Start-Sleep -Milliseconds 100 }
        
        Invoke-Sqlcmd -InputFile 'path to your\sqlscript.sql' -ServerInstance yourSQLServer -Database yourDB -QueryTimeout 0  

    Note: Invoke-Sqlcmd is not loved by everyone, and it has some limitations, but it’s there and mostly works. If you are on pre-2012 SQL client tools, beware the bug for –QueryTimeout 0 and long-running queries.

    That is all the the code we need to run within the background jobs.

    Lastly, we need to spin those background jobs up. This funny thing generates a list of numbers, 1 to 10:

    (1..$numWorkers)
    That list is piped to a foreach loop, which just drives the loop to run 10 times. For each iteration of the loop, we create a background job set to run the code in $jobscript:
        Start-Job $jobscript -ArgumentList $startTime -InitializationScript { Import-Module "sqlps" -DisableNameChecking }

    This statement again shows some of special behavior of background jobs – first, you have to explicitly pass in any values using the –ArgumentList parameter, because the scope where the job runs cannot “see” your other code or variables. This is how to get the startTime value into all the jobs.

    Second, the background job won’t have your PowerShell profile. In my profile, I load the sqlps module, which is where Invoke-Sqlcmd lives. Since my profile doesn’t load for the job, the job would not have that module available, and would not know what “Invoke-Sqlcmd” is. Fortunately, the Start-Job cmdlet has an –InitializationScript option that allows you to pass a second code block that runs before the real work in the job. This is a good place to set up or configure the job, in this case loading the required module.

    The last item you’ll need for this is the actual SQL script that has your testing logic, like running resource-intensive queries in a loop. I will leave that as an exercise for the reader.

    Once the jobs are running you can obviously watch PerMon or use other diagnostics to see how the subject server is handling the load. Back in PowerShell, you can use the cmdlets Get-Job, Receive-Job, Stop-Job and Remove-Job to manage or view the jobs created above.

    I used this to test a new server yesterday with a PCI flash storage card and drove it to 163,000 real, SQL Server, random read IOs/sec, which was pretty cool. Happy load testing!

  • File Layout Viewer vs. Drop Clustered Index

    I had a very educational exchange the other day on Twitter (also known as “getting schooled”) by my SQL compatriots Argenis Fernandez (@DBArgenis) and Hugo Kornelis (@Hugo_Kornelis). A guy called Chris Switalski (@crswit) asked a really interesting question on #sqlhelp, which I proceeded to answer incorrectly, which led to a correction by my friends, which in turn took me back into the SQL File Layout Viewer, which quickly uncovered a bug there.

    So, learn from your friends on Twitter!

    Anyway, here’s the question, and corrected demo using an updated SQL File Layout Viewer build.

    Question (paraphrasing):

    When you create a clustered index on a heap, the data from the heap is rewritten and ordered, and the upper levels of the index are added. If you drop that clustered index, what happens? Does the set of leaf-level pages stay in place, and the index pages just get unlinked and deallocated? Or is the data rewritten again? Does it stay in the same logical or physical order, even though the ordering requirement is removed?

    I had said that I imagine the data is rewritten in both cases, but that we could look and see to be sure. I was totally wrong – the leaf level of a clustered index is made of data pages, and a heap is made of data pages too, as Argenis pointed out. It turns out that, since they are not different structures down at that level, in order to drop a clustered index, the server will indeed just remove the upper levels of the index and leave all the leaf level data pages in place. Cool optimization! It turns out that my original release of the File Layout Viewer doesn’t always display this situation correctly, so I had to make some changes. This demo, though, I believe shows the behavior correctly.

    Answer:

    First, we make a test database and create a heap, and populate it with some dummy data:

    USE master
    IF EXISTS ( SELECT 1 FROM sys.databases WHERE name = 'HeapTest' ) DROP DATABASE HeapTest;
    CREATE DATABASE HeapTest;
    GO
    
    ALTER DATABASE HeapTest SET RECOVERY SIMPLE;
    GO
    ALTER DATABASE HeapTest MODIFY FILE ( NAME = N'HeapTest', SIZE = 102400KB );
    GO
    ALTER DATABASE HeapTest MODIFY FILE ( NAME = N'HeapTest_log', SIZE = 10240KB );
    GO
    
    USE HeapTest;
    
    CREATE TABLE SampleData ( 
        Data uniqueidentifier DEFAULT NEWID(), 
        Padding char(100) DEFAULT ( CAST( CHECKSUM( NEWID() ) as CHAR(100) ) ) 
    );
    GO
    
    INSERT SampleData DEFAULT VALUES;
    GO 100000

    If we run the File Layout Viewer, we see a nice, contiguous set of data pages (type 1) in teal:

    FLVDropClust01

    Next, we’ll add a NON clustered index, just so we can track what happens to that structure too:

    CREATE NONCLUSTERED INDEX Nonclust ON SampleData ( Padding );
    

    That shows up in the next part of the data file, in bright blue:

    FLVDropClust02

    Now if we make a clustered index on the heap, it will have to be rewritten into another region of the file, so that it can be stored in the right logical order. We should expect it to “move” in the visualization down into the empty part of the file. The non clustered index will also be rewritten, because it will switch from using physical row locations to point to data in the heap to using the clustering key to point to data in the new clustered index. Let’s see what it looks like:

    CREATE CLUSTERED INDEX Clust ON SampleData ( Data );
    

    The new clustered index and new version of the NON clustered index are both created in another region of the data file:

    FLVDropClust03

    You can see, in lighter gray, the regions that were emptied where the two structures had been stored, and the two new structures now in the center of the graphic.

    Now for the part of the question I got wrong: what happens when we drop that clustered index?

    DROP INDEX SampleData.Clust;
    

    This is really neat: the leaf level of the clustered index (remember data pages are data pages) stays right where it is, and the upper level index pages are just deallocated/unlinked.

    The NON clustered index does have to be rebuilt yet again to change those clustered index keys out for row location pointers that point to the rows in the heap, so those pages do get rewritten into yet another region of the file:

    FLVDropClust04

    So, to summarize:

    When you add or remove a clustered index from a table, any NON clustered indexes always have to be rebuilt - and will be automatically - to exchange row pointers for clustering keys in the non clustered index itself. This generally takes care of itself, but it can be a performance issue, or cause longer running transactions or a large amount of logging.

    When you CREATE a clustered index on a heap, the data has to be rewritten to enforce the logical order of the index.

    When you DROP a clustered index, however, the server can keep all the leaf level pages in place and just strip away the index levels above.

  • Telecommuting, Month 9

    Five Things I’ve Learned About Surviving the Telecommute

    Over the past several years my wife and I went through this strange series of events where we swapped work roles – first she worked remotely for a Minnesota company, from our home in Seattle, and then we moved to Saint Paul, Minnesota and I started working remotely for my company back in Seattle. Our story is a lot like what you read about remote work, but it’s different to live it.

    Backstory

    We lived on Vashon Island in Puget Sound, near Seattle, for about eight years. The small company my wife works for relocated from there to Minnesota, but she kept her job, working remotely as part of what became a virtual corporation. They were a very small company, but tech-savvy enough that, while not an IT business, remote work seemed like a natural arrangement. I think that shows just how far that idea has come. Like many U.S. families, we video conference with the Grandmas and Grandpas spread around the country, which she effortlessly applied to work, too. Her little company used “cloud” services without really thinking about that idea as anything new or special. It was just a natural fit. They valued her, she valued them, the tech was pretty easy. It was all win.

    Fast forward a few years and our family was looking to relocate, too. Her company was not really the destination – we needed mainly to be closer to our extended families. “Closer” today often means flight time and number of connections (a.k.a. airport) more than physical proximity. Our families are spread all over the eastern half of the U.S. so moving to one or the other city was something we considered, but wasn’t really the only solution. We have kids, but we are also city people and we both really wanted the culture and amenities of a real city and a good, safe urban environment for our boys. Not too many cities have that combination.

    “How the heck did you pick Saint Paul, MN?” is a question I get all the time. It’s hard to explain, but take the airport, proximity to East Coast, South and Northwest by one direct flight, the truly outstanding elementary school in our new neighborhood, the old Victorian gridiron neighborhoods of Summit Hill/Macalester/Groveland, a good friend from college, no fear of snow, and toss them all together in a geography salad, and that was our destination.

    By freak accident, or fate, it’s also not far from my wife’s workplace.

    Switcheroo

    We moved and we both kept our jobs, which was amazing. My wife is able to go into the office a few days a week now, though it’s a long drive, and I started telecommuting to my job in Seattle from Saint Paul.

    Both of us have experienced the usual pros and cons of telecommuting. By now some of these things are well documented, but I think it’s hard to see how important those issues are until you experience them firsthand, either as a remote employee or working with a remote employee, and over a long period. It’s great, and it’s difficult at the same time.

    I will say my employer has been really wonderful throughout this process – they are supportive, and they had to go through extra effort to make this happen for me, which they certainly didn’t have to do. We had some tentative few-days-a-week sort of telecommuting for a while, but I think I am the first true full time, long distance, different-time-zone sort of remote worker, with all the challenges that come along in that scenario. Because there wasn’t really an established environment used to remote work, with habits and technology and processes in place, we’ve had to work some things out as we go.

    In fact every aspect of it, for me, has proven a double-edged sword, and I wanted to spend some time talking about why.

    Coffee Shop? No.

    First I have to get one myth out of the way: working in a coffee shop is OK -- if you want to work in an uncomfortable, noisy place where you can’t use the phone, the network is dodgy, your screen’s too small, coffee costs $4, and there’s no privacy. It only sounds romantic. Honestly.

    I work from a dedicated space in my house in order to be focused and minimize distractions, and that has worked well. I’d say to even contemplate working from home you need a separate space, with a door you can close. The attic space where I currently work is great in terms of creating a separate work environment in the house. I have a real desk and an IP phone connected straight to the office. The space has some other issues, notably not being a layout where office furnishings, network and power work well, but we are moving to a permanent home a few blocks away, and in purchasing that house the “office space” was one of the main purchase criteria. We should be into the new place in late June.

    On the other hand, working alone in a home office has some real social challenges, as I’ll talk about more below. It’s quiet. Really, really quiet, which can be too much of a good thing.

    Losing the Commute

    I have always – so far, anyway – avoided a typical driving commute. Spending hours on the freeway in a car is so awful, for me, I have always orchestrated living and working so as to avoid it, by living close to my office or to transit. Eliminating the commute altogether is obviously a fantastic perk of telecommuting. I am eternally grateful for the hours I’ve gotten back.

    And yet.

    Here’s the strange thing, which others have talked about too: if you are the type of person for whom remote work will be successful, you probably are the type who has a passion for work, where work isn’t just time spent from eight to five. Passion for work is something that creates the discipline to work consistently without explicit direction, and to produce all on your own, regardless of schedule.

    Those same traits also make the removal of the commute problematic. Life and work can start to merge together, and it isn’t always pretty. Leaving the house, getting in the car/bus/train/ferry and being forced to sit there, even for twenty minutes, is an important psychological boundary between work and life, where you – even subconsciously – leave one and get mentally prepared for the other.

    This is something I really struggle with. I have a tendency to obsess about work anyway, and when it’s always right here, that can be a real problem. It’s particularly bad during the times when work is stressful. Working from home demands that your team evaluate you on your output as much as just your presence, but your perceived presence looks irregular, and that can cut both ways.

    Isolation

    There are huge benefits to being out of the daily fray in the office, mostly in terms of productivity on heads-down work. Interruptions are minimal. I used to use the “Pomodoro” technique in the office, working in 20-25 minute sprints, but it was mostly as a survival technique against the onslaught of meetings, and email, and walk-up requests. I find in my new situation that I don’t have to do that, because finding solid blocks of time to do real work isn’t so much of a problem.

    The other refreshing thing is that one important type of work – thought – is OK. I used to be in a cube, in an open office. One challenge was that some parts of my job require me just to think. Think about strategy. Think about how to design against some complex problem. Read books. When you think, you don’t necessarily look busy. It’s productive time, but you aren’t producing a thing someone can point at. So, time to think through problems without some pretense of tapping at the keyboard is really valuable. In the office I used to block time out and walk on a nearby bike path in order to have this time without needing to look “busy.” Now I can do this in the house, which is great.

    Communication with my team really is a challenge. I have three-times-a-week calls with my team lead, and daily short calls with my team just to check in. It sounds odd, but these are a real requirement just to help offset the lack of hallway conversation that normally takes place in the office. If you are thinking of working remotely, I would advise making the extra effort to schedule time like this, because even though it sounds trivial, it’s really important to keep everyone on the same page.

    The biggest challenge of all is missing on the rest of the hallway or cube/desk decision-making that can go on. I always had my ear to the ground in the office, and viewed part of my DBA role as trying to steer everyone to the right decisions about data and SQL Server, even when it meant heading something off that I overheard in the next cube. For better or worse, that part of my job isn’t possible remotely, and that can be a challenge.

    But I’ll also lay this right out there: it’s lonely. At different times in life we have different social structures to rely on for friendships. There’s high school, college, church -- before kids my wife and I used to have dinner with friends or have the occasional party. If you’re 40+, with kids, work is a main source of social interaction and friends. There aren’t that many other social structures like it. Moving to a new city and essentially breaking that social interaction with my coworkers is tough, and there isn’t much to replace it.

    Social media (and I’m hooked on it) doesn’t fill that void, I’m afraid. It naturally took over more of my free time with this move, but as months go by I realize there’s something depressing about it, and I need to limit the time I spend online.

    Children

    Some things about remote work have helped our family life, but mostly it relates to basics like flexible hours and the time recovered from not spending an hour a day on the road. If you think that working remotely will help with child care expenses, I’d say stop – that’s unlikely. It’s not possible, for me anyway, to actually care for the kids in any reasonable way and work at the same time. There’s no way. I think I could do a poor job with work and a poor job watching them, and that serves nobody well. We have just as many childcare arrangements and expenses as if I drove to work.

    But I do love the little perks. Many mornings I get to eat breakfast with the kids and put the oldest one, who is in second grade, on the bus. My wife and I have lunch dates, usually once a week, on a day we both work from home and the kids are at school.

    On the other hand, there are times when I have to explicitly stop my wife or my kids and just say, “I really can’t do x because I am working. Truly.” And it is easier for them, just because I’m in the house, to ask or stop me, “Honey can you stop and …” Having my wife have work-from-home experience too really helps, but this can be a problem (both directions, I am guilty too).

    For better or worse, when we really have an issue with the family, which sometimes comes down to just not having enough hours in the day, my wife or I can work in the evening to make up time or meet deadlines, and that is definitely a major benefit. It can, though, aggravate the whole problem of not leaving work at work and being free of it at home.

    Tech, Tools and Email

    This is the part all us gadget freaks gravitate to, I think – what tools make work-from-home possible, or make it better. Here the news is also mixed.

    First, I. Love. Email. To a fault.

    Imagine a medium where multiple people can collaborate on a solution and everyone is informed, where the outcome and the trail of decisions is documented in writing automatically, and the participants can all work on it asynchronously, without being interrupted. In a perfect world, it seems like email would almost do that. (I dislike the phone for all the inverse reasons.)

    And yet.

    Working remotely makes it so easy to fall into an email-only mode of communication, and while I love that, it flatly doesn’t work well for some types of activity. For one, I find it almost impossible to persuade some individuals to do something through email. To act. I am very sorry to report that that still often requires a phone call or, ack!, a full-on meeting, with a manager, because email simply goes ignored. Among those that like email, email works, but with others, sometimes not. Fortunately my team lead understands this well, and I know I can ask him to set up that meeting or that call when we have to go that route. Email can also foster miscommunication in some circumstances, especially relating to tone and personal interaction. This email thing is clearly a habit I have to be careful about.

    What about all those other tools? Tele-presence and all the funnily-named collaboration do-dads like Skype, Trello, Dropbox, Toggl and Producteev and Flow and Remember the Milk and Google Hangouts and Asana?

    There is, unfortunately, a sobering reality here. The number and variety of online collaboration tools really has exploded recently with the growth in remote workforce and virtual organizations. But it’s been my experience that a company probably has to be young (in attitude, not necessarily age), nimble, small, has to be engaged in ideas about the modern workforce, willing to try new things, and has to deal with few secrets to contemplate using most of these services. Otherwise the obstacles are many: change, security, habit, ignorance, organizational inertia, isolation, policies, firewalls both real and imagined.

    If you are in a company that is bigger, a company with security concerns, a company that cannot change course quickly, then it seems likely you might get the right to work from home but not get access to any of these cool new tools.

    So, I have to say, in my case I have Lync (barely) for desk or app-sharing, and an IT group that is just now getting the hang of that, but no video, no tele-presence, no novel SaaS online project management whiz-bang anything, funny names or no. And I suspect that the work-from-home crowd may fall into two groups – those with the new tools in small companies or start-ups, and those in established companies like me trying to make it work with just RDP and the phone.

    I don’t think this is related to my company’s commitment to my success working remotely, or to their support of it – it’s simply too hard to overcome the organizational inertia, which will be true in many organizations.

    But with some good processes and habits, I think the minimal-tools predicament can still work. Process and habit, I believe, are probably 85% of the battle, and while coolio tools might aid that, I don’t imagine they are a substitute for it.

    I do get jealous of my wife on this front, though, because a small company on a shoestring budget can do more with cloud-based applications like this than a bigger, well-capitalized one. Ironic.

    Conclusions?

    I think of this as an ongoing experiment. I love my job as a DBA – though, to be sure, every job comes with some challenges – and in many ways it’s suited to remote work. I am finding it harder than I imagined, mostly because it’s such a solitary existence, and because the lines between work and home are so blurred as to be almost erased. I have to keep changing techniques and trying different things to make it work well.

    It’s become clear that I should not wait for the next “teh aw3some” tool to come around, and instead focus on basics like process, communication, and habit to make the fundamental tools of RDP, phone and airplane work to their fullest.

     

    P.S. I am grateful to SQL community friends Tom LaRock, Mike Walsh, and Aaron Bertrand for part of this remote work adventure.

    Edit 23 April: A friend of mine reminded me how important it is to go back into the office and talk with people face to face. I thought I should add, since I didn't make this clear, that I do go back for three days or so every other month. That's vital to this whole arrangement. I try to pack those visits with meetings just so I can talk with people and get reacquainted. 

     

  • Speaking in Chicago Saturday

    I’m excited to be offering a beginning PowerShell session at SQL Saturday #211 in Chicago on Saturday, April 13. This time we’re making it a family weekend, bringing our two boys. I haven’t been to Chicago for many years, and it’s, of course, an Architecture dream world, so it should be fun to look at some buildings again!

    Emil_Bach_House

    Emil Bach House, photo © Jeremy Atherton, 2006

    The session is intended to help you get started with some PowerShell syntax basics. I try to demystify the Pipeline a bit, help make sense out of all the { curlies } and [ brackets ] and ( parentheses ), and translate some of what you already know about T-SQL into PowerShell. If PowerShell seems cryptic or mysterious, come on down and maybe we can sort it out together.

    Should be a great time.

  • SSMS 2012 Restore GUI Gotcha

    Today I want to bring to your attention an issue in the SQL Server Management Studio 2012 restore GUI. In many ways the new restore dialog is nicer than the old one, with new features and added convenience – but, as is always the Achilles heel of GUI tools like this, if you don’t know what’s really going on it can bite you. I’m not sure what to call this issue, maybe just a UI design flaw. Technically it works as designed, and there’s nothing really wrong with it, so it’s not a bug. But I can imagine it really causing someone pain who is careless or doesn’t know what’s happening behind the scenes.

    Restoring a copy of a database make take the original down.

    Among the new features is some management of “Tail-Log Backups,” which is a wonderful idea. Taking a backup of the tail of the log will take down the database and put the whole end of the log into a backup file, preserving basically all modifications to the DB from the log. This is ideal, for example, at the last step of migrating a database using log shipping, because all the activity is preserved and the source database is locked against further modifications. The log chain is preserved in moving to the new copy of the database, or in a DR scenario, to your other server.

    The problem is, I know a lot of people don’t know this. Often its people who use the GUI a lot, such as “accidental DBAs.” I think the GUI is really important despite those who rarely use it, or frown on it.

    Here’s the issue: Suppose I have a production database that has some user-created problem (like a user accidentally updated some rows.) It may be reasonable to restore a copy of the database to a point before the issue and investigate whether it’s possible to merge the data back in. The original database isn’t damaged from a technical point of view – there’s no corruption, for example.

    1. Right-click the original source database and choose Tasks > Restore > Database … from the context menu.
    2. The resulting dialog will go out and discover the backup files for that database from MSDB, which is very helpful.
    3. Change the name of the target database, in order to restore it as a copy.

    RestoreGUIIssueCap1

    There’s a note at the top of the dialog that, while accurate, may not help some in this scenario. It says, “A tail-log backup of the source database will be taken.” Sounds innocuous if you don’t know what that implies. In past versions, restoring a copy of a database would never affect the original – or not by default, anyway – so I cringe as I imagine people merrily clicking past this warning.

    The script you get with these settings is indeed composed with a tail-log backup of the source database:

    RestoreGUIIssueCap2

    That will, as indicated, take a tail-log backup – and thereby take the original database down. Nice if you wanted that, but a mean surprise if you didn’t.

    If you act on the warning and click the Options tab, and then uncheck the offending setting, you do get the expected behavior (the original database is unaffected because the tail-log backup is not included at the top of the script):

    RestoreGUIIssueCap3

    So, be careful out there!

  • T-SQL Tuesday #040: Files, Filegroups and Visualizing Interleaved Objects

    Early in my career as a DBA, I have to admit I didn’t quite “get” what all the fuss was about with the multiple file and multiple filegroup capability in SQL Server. Over the years, though, as I learned more about partitioning, backup strategies for large databases and, most importantly, storage internals I’ve started to catch on. For today’s T-SQL Tuesday, I thought I would share an early lesson, using the newly released SQL File Layout Viewer utility available here.

    tsql2sday

    Adam Machanic started T-SQL Tuesday several years ago and it’s been a great way to motivate bloggers and readers to share ideas around a topic. This month’s even is hosted by Jen McCown / MidnightDBA. An awesome idea, as ever.

    What I hope to illustrate today is a simple but vital concept about files and file groups: files, on their own, use a proportional fill algorithm in SQL Server and will receive new data from any object that needs more space. If an object needs additional storage, an allocation will be made in an empty area of whichever file has the most space available. Under the multi-file feature alone, there is no attempt whatsoever to separate objects in the database logically into different files. The result is that objects can appear to “round robin” through the files as more space is allocated – it’s not really a round-robin algorithm, but if the files are all about equally full, it looks similar.

    Filegroups, on the other hand, allow logically grouping objects in a way that specifically directs them into separate files, and so controls the physical layout of those objects on disk. Using filegroups it’s possible to direct an object into a file (or set of files) and physically separate it from the rest of the database.

    Because of this distinction, it’s been my experience that filegroups are far and away the more useful of the two features, and if I have a database that merits splitting up, I go straight for filegroups, with one physical file per group. (There are some edge case exceptions like TempDB allocation contention that are outside the scope of this post.)

    Interleaving

    There are a few reasons we’d want to physically separate objects, but, because it so clearly illustrates the difference, I want to show interleaving today. I made a demo database using the following script:

    USE [master]
    GO
    
    :setvar datapath "C:\Program Files\Microsoft SQL Server\MSSQL10_50.SQL2008R2\MSSQL\DATA"
    :setvar logpath "C:\Program Files\Microsoft SQL Server\MSSQL10_50.SQL2008R2\MSSQL\DATA"
    
    CREATE DATABASE [VizDemo3] ON  PRIMARY 
    ( 
        NAME = N'VizDemo3', FILENAME = N'$(datapath)\VizDemo3.mdf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    FILEGROUP [MultiFileFG] 
    ( 
        NAME = N'VizDemoMFGF1', FILENAME = N'$(datapath)\VizDemoMFGF1.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    ( 
        NAME = N'VizDemoMFGF2', FILENAME = N'$(datapath)\VizDemoMFGF2.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    ( 
        NAME = N'VizDemoMFGF3', FILENAME = N'$(datapath)\VizDemoMFGF3.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    FILEGROUP [SingleFileFG1] 
    ( 
        NAME = N'VizDemoSFG1', FILENAME = N'$(datapath)\VizDemoSFG1.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    FILEGROUP [SingleFileFG2] 
    ( 
        NAME = N'VizDemoSFG2', FILENAME = N'$(datapath)\VizDemoSFG2.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    )
    LOG ON 
    ( 
        NAME = N'VizDemo3_log', FILENAME = N'$(logpath)\VizDemo3_log.ldf' , 
        SIZE = 10240KB , MAXSIZE = 2048GB , FILEGROWTH = 10240KB 
    )
    GO
    
    ALTER DATABASE [VizDemo3] SET RECOVERY SIMPLE 
    GO
    
    USE VizDemo3
    GO
    
    CREATE TABLE dbo.SampleCustomers_on_Primary  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON [PRIMARY] ;
    GO
    
    CREATE TABLE dbo.SampleOrders_on_Primary  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON [PRIMARY] ;
    GO
    
    INSERT dbo.SampleCustomers_on_Primary DEFAULT VALUES;
    INSERT dbo.SampleOrders_on_Primary DEFAULT VALUES;
    GO 20000
    
    
    CREATE TABLE dbo.SampleCustomers_on_Files  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON MultiFileFG ;
    GO
    
    CREATE TABLE dbo.SampleOrders_on_Files  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON MultiFileFG ;
    GO
    
    INSERT dbo.SampleCustomers_on_Files DEFAULT VALUES;
    INSERT dbo.SampleOrders_on_Files DEFAULT VALUES;
    GO 20000
    
    
    CREATE TABLE dbo.SampleCustomers_on_FileGroups  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON SingleFileFG1 ;
    GO
    
    CREATE TABLE dbo.SampleOrders_on_FileGroups  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON SingleFileFG2 ;
    GO
    
    INSERT dbo.SampleCustomers_on_FileGroups DEFAULT VALUES;
    INSERT dbo.SampleOrders_on_FileGroups DEFAULT VALUES;
    GO 20000

    This is a really simple example showing several possible file layouts, and what those do with the physical storage of the data. Starting from the top, note that the database create statement makes three different file setups:

    1. Our old friend Primary – one MDF file to rule them all!
    2. A file group with three physical files: MultiFileFG. These files will be populated with SQL Server’s proportional fill.
    3. Two file groups with one physical file apiece: SingleFileFG1 and SingleFileFG2. This allows the DBA to direct objects into specific files on disk.

    Next I make some pairs of sample tables, and locate a pair in Primary, a pair in MultiFileFG, and a pair (one table each) in SingleFileFG1 and 2. The tables are each populated with 20,000 rows of sample data, arbitrarily called Customers and Orders.

    It’s important to note that the tables are filled with data in an alternating fashion – one table gets a row, then the other table gets a row – and that means that the allocations for space for those objects will also alternate. One table grows, then the other, then the first again.

    If the tables are in the same filegroup, even with separate physical files, the data will be interleaved. There will be a block of pages for one table allocated, then from the other table. Under default settings in SQL Server (not using the –E parameter, for example) the blocks of pages will each be a 64k extent of 8 data pages.

    So, let’s have a look at what happens inside the files, and see if it supports this logic.

    One File

    InterleaveScreenCap1

    My goodness it does! But you probably suspected that, as I’m the guy writing this post.

    First let’s look at the blue area – the stack-of-bricks looking light blue and purple region is the Primary filegroup / MDF file. What we are seeing is the alternating, interleaved extents from the two tables, one in blue and one in lavender/purple. This isn’t terrible, but there are two things worth noting:

    1. There’s a decent amount of fragmentation. Not harmful, necessarily, but these two tables definitely are not contiguous. If we have an application that wants to do a lot of scans, and we have spinning media, we might have some performance degradation. Read-ahead probably doesn’t work. This might not be ideal for a data warehouse.
    2. If we re-index one of those objects the other one will remain full of “holes.” We could re-index both, but it might take a few passes before this interleaving works itself out.

    I want to emphasize here that this isn’t necessarily a terrible situation, but it’s illuminating as far as how the data is physically written out.

    Three Files in a Group

    The next reasonable question to ask is whether making multiple files, alone, might solve this issue. So the next section of the visualizer, scrolling down a bit, shows the one filegroup/multiple file scenario from the demo. This is filegroup MultiFileFG, shown in alternating pink and purple:

    InterleaveScreenCap2

    Notice, we got three files, but the two tables we map into them are still interleaved, now across all three files. (Each file in the visualizer has a signature of orange-gray-orage system pages right at the beginning.)

    This pattern is because of the proportional fill applied to the physical files inside one filegroup. An extent is allocated for an object from one file, then the next file has the most available space, so the next file used for the next extent, and so on. This means our tables are still cut up into 8-page extents spread across the physical media.

    Two Groups with One File Each

    The last section of the demo writes each table into a separate filegroup, where each filegroup is backed by a single physical file. The results here are represented by the solid pink area in the graphic:

    InterleaveScreenCap3

    In this case, the alternating inserts in the script don’t result in alternating allocations inside the files, because one table is specifically isolated to one filegroup and the other table to the other. In this case, the data in the files is practically contiguous.

    Conclusion

    Does this mean you should go out and make 1,000 file groups, one for every table or index? Please don’t! There’s a balance when planning files and filegroups between manageability, imagined performance impact of fragmentation (or real), complexity, and many other factors. SQL Server actually does a fine job, on smaller databases, working with one or a handful of files. But if you are planning a larger warehouse project, or you find the hardware you have struggles with the volume of data you need to manage, or you are looking at partitioning, then understanding how these storage internals behave, along with many other design considerations, will help.

  • Presenting to East Iowa I-380 U.G. March 12

    Tomorrow I’m making the beautiful drive down from Minneapolis into Iowa to present to the I-180 / East Iowa SQL Server Users Group. I’ll be talking about managing lots and lots of unruly databases, and presenting a walkthrough of this maintenance solution. If you’re in the area please come check it out!

  • March Update to Rules-Driven Maintenance

    This month I have a minor update to the Rules-Driven Maintenance code I originally posted back in August 2012. This update has just two enhancements, but they are nice ones, I think:

    1. Much improved handling for instances that use database snapshots.
    2. Ability to do intra-day differential backups purely by setting maintenance times and limits in the policy table.

    The code posted here is cumulative, and replaces entirely the original code, but please refer back to that original blog post for description, instructions and details. This update can be installed right over an existing deployment, or be installed all by itself as a complete solution but – as always – please test and be cautious.

    Enjoy!

    Note: I've posted updates to this solution. Please get the latest version here.

  • Trigger Authoring Decision Tree

    TriggersDecisionTree

  • Public Release, SQL Server File Layout Viewer

    Version 1.0 is Now Available!

    Edit 9 October 2013: A new version is out! Please download 1.2 from here. 

    I’ve been working off and on, as my real job permits, on this visualization tool for SQL Server data files. This is an educational or exploratory tool where you can more readily see how the individual data pages in MDF/NDF files are organized, where your tables and indexes live, what effect operations like index rebuild or index reorganize have on the physical layout of the data pages.

    FileLayoutViewerR1

    The viewer will scan a whole database, using only SQL and DBCC commands, and will render a color-coded representation of all the data pages represented in colored bands. Each partition of each index or heap in the database is assigned a color, so that you can see where all the bits and pieces of an object are located in the files. Above the colored bands there are grayscale or white pixels that show the page type in SQL Server (most are white, which are data pages. Unused/empty regions of the file show as gray). In the image above, for example, all the bright green areas are one index, all the purple areas are one index, and so on.

    There is mouse-over functionality. If you move the mouse cursor over the graph, then details about each page populate the text fields at right, including the object and index the page belongs to, the page type, whether the page represents a fragment, where the previous and next pages are for the same object, etc.

    Why?

    Why create something like this? I am a visual person, and I have a theory that many issues we have in computing come down to not being able to see what’s going on. This is especially true as we learn about unfamiliar technology – we have to develop a mental model of structures like B-trees or linked lists or files in order to understand what’s happening. I hope this tool, combined with other knowledge, will help people form an accurate understanding of how data file internals work in SQL Server, faster than working purely in the abstract with tools like DBCC Page or DBCC Ind.

    Instructions

    1. Download the tool and unzip it. The package includes both an executable and the source code. If you don’t want the source, the .exe file is a standalone program and will run all on its own, so you are welcome to discard the source folder.
    2. Validate you have the required prerequisites from the Prereq’s section below.
    3. Locate a non-production/test database to analyze. The database can be local or on a remote server. I suggest something of a reasonable size, because scanning a really huge data set can take quite a long time.
    4. Run SQLFileLayoutViewer.exe and select a database to scan. If the database is on a remote server, type the SQL Server name/instance name into the dialog.
    5. Click Analyze.
    6. Examine the resulting graph, and mouse over it with the cursor to view detailed information about each page.

    Disclaimer

    This is a freeware tool provided for your fun, education and entertainment. However, there is no warranty of any kind and you use it at your sole risk. The tool is free but offered under the GNU General Public License 3. If successful, and people are interested, I’ll move this work to some sort of open source project.

    Prerequisites

    The app requires .NET Framework 4.0 and the SQL Server management tools. I’ve tested it on Windows 7, Windows Server 2008 R2 and Windows 8. It can be run against a database on a local or remote SQL instance. I believe it will work on any database in SQL Server 2005 or later, but have not tested every possible scenario.

    Risks?

    I believe this tool to be relatively risk free, but I would avoid running it against live production data. The tool’s data collection is simple: it will issue a few system table selects to get things like object names, and then it will execute a DBCC PAGE statement against every page in the database. All other processing after that is done locally in the application itself. It does not modify the database.

    Bugs?

    I would love to hear about bugs you come across, or additional features you think would be valuable. Please contact me through this site. Note that I am a DBA first, and an amateur .NET developer a distant second, so please be gentle.

    Enjoy!

  • Presenting for PASSMN User Group Feb. 19

    I’m pleased to be presenting the talk “Shoveling Frogs into a Wheelbarrow” for my new hometown user group PASSMN on February 19, 2013. Come on down if you’re in the Twin Cities.

    This talk is close to my heart, as it shows how I grew from a younger, greener DBA responsible for one or two systems into a terrified overwhelmed more mature DBA facing 2000+ databases alone. I hope this will be a highly interactive session, as I am sure there will be a mix of people in the audience with more or less experience than I have, and more or less complex environments to manage.

    <Ribbit.>

  • OT Le Casque Zik de Parrot Totally Biased Review (Zik Headphones)

    I’m not a complete gadget freak, but sometimes a widget draws me in and pushes that gadget-lust button. Such was the case last year when Philippe Starck and the French bluetooth device maker Parrot announced a collaboration to make what are probably the coolest designer headphones anywhere:

    http://www.journaldugeek.com/forum/topic/9825-news-un-prix-et-une-date-pour-le-casque-zik-de-parrot/

    I never thought I would own a pair, at $400, but my darling, awesome, rock star wife Denise bought me some for Christmas. I thought I should do some kind of objective review now that I have had them for a month, but I am so in love with these that that’s clearly impossible. So, instead, here’s my completely biased review/tour of Le Zik. I’m posting it here just because I hope my SQL peeps are also travellers and also love tech like this.

    This is the kit I travel with:

    P1090832

    I have a Lenovo Thinkpad x220 tablet with Windows 8, top. It has a touch screen that you can twist and fold to use the device as a tablet. Size-wise it’s smaller than a laptop and bigger than a tablet. On a plane it is a thing of beauty – fully functional computer for work, tablet for reading, movies or music without carrying an extra device. Across the middle is what you get in the box with Parrot Zik. At the bottom is my no-frills, cheap Android phone.

    Zik looks like this:

    P1090835P1090836P1090837

    The headphones feel very solidly constructed but not heavy. The headband is connected to the ear pieces with gorgeous sculpted aluminum “arms,” with orange accents at the point where they disappear into the top band. The aluminum pieces are articulated so that the unit folds flat to store or carry, and for automatic fit when they are on. I am a bit on the small side, but they fit very comfortably for several hours at a time. The only adjustment is that those arms extend out farther from the headband for bigger heads.

    The controls, ports and the mics for noise cancellation are in a coordinating aluminum inset on the bottom of the ear pieces:

    P1090838

    Headphones are generally mundane, occasionally interesting from a design point of view, but Starck really manages to elevate them – these are really nice, well put together, comfortable – the whole thing works well, and is easy on the eyes while remaining simple and understated. All industrial design should be so good!

    What’s the Big Deal?

    Here are the features packed into these little guys:

    • Active noise cancellation
    • Bluetooth for wireless audio, software controls, and headset mic function
    • Wired function as a backup, or for older audio devices
    • World-class audio quality
    • Touch controls (!) right in the surface of the right earpiece that control volume, start, stop of media

    Essentially, these have the software and hardware to integrate with any bluetooth device you use for audio. Watch a movie wirelessly from my laptop? Done. Listen to music wirelessly from my phone? Done. I came from using basic wired ear buds (like I said, perhaps this isn’t exactly an objective review) so this feature set blew me away. There might be other headsets with the same software integration.

    When I got these, and figured out how to install the battery, which is a clever puzzle, I took them for an audio spin first. I installed the app on my phone, charged them over USB, and then did a tour of many types of music. I have strange taste in music, probably, so please be gentle:

    • Clapton, Joe Bonamassa and Jeff Beck gave me some sense of the guitar/rock/blues side
    • Some Swedish House Mafia for the “Oonse Oonse Oonse”-ability and bass extension
    • Jackson Browne’s two Solo Acoustic albums for intimate folk/vocals
    • Brandi Carlile Live at Benaroya Hall
    • Macklemore, which is about as close to Hip Hop as I get :-)
    • Yes, if you are into old-school virtuoso rock
    • Various Wynton Marsalis Jazz tunes

    Spotify, Pandora, iTunes and my phone’s music player all worked seamlessly. The app installed and worked without any drama – and for my laptop, no software was required at all. It paired with the headphones and they just worked. Everything sounded amazing.

    The touch controls work well and are intuitive – you swipe a finger up, down, or across the black surface of the right earphone to change volume, skip ahead, skip back or pause.

    The only strange thing was getting used to turning them off when I put them down, so the battery would not run while I didn’t have them on. I have since developed that habit. The on/off button is in a very convenient place and has white (on) and red (charging) backlight LEDs, and a tactile bump so you can hit it without looking. Great details.

    What about that Problem Using the Cable?

    I had read, before we bought these, that some people complained about poor audio quality when using them in wired mode instead of over bluetooth, and that, at $400, was a worry. Here’s the thing I think may have happened with those folks – though I am speculating: when I first plugged in the audio cable, at the headphones end, it was really stiff. I thought I had seated the jack all the way in, but I listened, and sure enough, a thin, tinny sound and practically nothing from one earphone. However, I took them off and gave that jack a good shove, and, click, there was one more notch. After that it sounded perfect.

    Here’s one thing about the wired-or-not setup:

    The headphones have audio enhancement software built in, like a digital equalizer and a “concert” mode that enhances the stereo. These features are controlled from the phone app, but reside in the headphones. However, those features, from what I can tell, work on the bluetooth stream and not over the wire. So, while the wired sound is excellent, you can hear a difference. You have to apply those effects, if you want them, using the source system or software (your music player or stereo). Example: if you use the EQ in the software, I believe it only affects the signal over bluetooth. If you want an EQ over the wire, then the audio source software or device has to have one, and you’d use that.

    To me this is no problem at all.

    Just to be sure, I did this test:

    On Brandi Carlile’s Live at Benaroya Hall, track Turpentine is one I know really well, because I was there at Benaroya Hall when it was recorded. Amazing show. If you listen really carefully you can hear me singing in the audience on the recording (ok, actually not). Anyway, I tried switching back and forth between bluetooth and wired connections during that track to see what the differences were, and if the wired connection was worse in any way, and I’m here to say the wire sounds just fine. If that’s holding you back from a Zik purchase, go for it. Non-issue.

    Made for Planes

    I was itching to try these on an airplane, and this week a work trip granted me my wish. While they are good at home, they are amazing in an airplane. The noise cancellation cut all the plane noise down to a faint hiss, while other people’s voices in the cabin still came through. For my flights I rented two movies to watch through iTunes going and coming back. I tried for something with musical interest, but that I hadn’t seen:

    Going: Pitch Perfect

    Coming back: Searching for Sugar Man

    Pitch Perfect was a little light, but entertaining, and it has a fun sound track. I really liked the lead character Beca (Anna Kendrick), who is super cute.

    Searching for Sugar Man was just amazing – great film.

    The Zik was (were?) incredible on the trip. Comfortable, great sound quality. I charged them before each of the flights, and a four-hour flight used about 40% of their battery.

    Down sides?

    Overall I am really happy with these. They could use a hard case, and that thing with the effort required to plug things in when the unit was new is a quibble. Overall, though, Le Zik is just full to overflowing with L’Awesome.

  • Visualizing Data File Layout III

    This is part three of a blog series illustrating a method to render the file structure of a SQL Server database into a graphic visualization.

    Previous Installments:

    Part 1

    Part 2

    Those that have been reading this series might be be thinking, “Is he going to go there?” Well, the answer is “Yes.” This is the GUID clustered index post that had to be. It’s inevitable with this tool.

    If you follow SQL Server at all, you are probably aware of the long-standing debate about whether it is wise, desirable, smart, useful, or what have you, to identify rows using GUIDs. I won’t take a position on that, but I will show here, I hope objectively, a few things that the visualizer shows about file layout vs. distributed inserts, distributed inserts being one of the main challenges around using GUIDs as clustering keys. Just to recap the argument very, very briefly:

    Advantages

    GUID keys can be generated at the client, which saves a round-trip to the database server to create a collection of related rows.

    GUID keys can make certain architectures like sharding, or peer to peer replication, or merging multiple source databases, simpler.

    Disadvantages

    GUID keys are wider, therefore they take more space in memory and on disk. The additional space is multiplied by their presence in both clustered and non-clustered indexes if they are a clustering key.

    GUID keys don’t only take more space in RAM and on disk because of their width. They also cause distributed inserts into the clustered index – that is, new rows are added to any and all pages in the index. Each time a row has to be added, the target page must be read into memory, and at a checkpoint, the whole changed page (both existing and new rows) must be written to disk. This has two effects:

    1. The amount of RAM and disk IO required for inserts is probably much higher, as pages with existing data must come into cache, get changed, and then be written back out again. Essentially, large parts of the table have to be rewritten to disk to append rows to pages that have data already.
    2. The pages that store the index will individually fill up, and have to split such that half the existing rows are written back out to the “old” page and half written out to a “new” page in a different location on disk. This causes the pages to be less full, the same number of rows to require more space on disk and in RAM, and the resulting index to be massively fragmented on disk.

    I am not writing to argue these points, which have I think been established by both sides of the debate, only to see if the visualizer shows these effects clearly. Most of the argument isn’t actually about these facts (they are all true, as far as I know) but rather which are more important, and I think that is the main source of debate on the issue.

    Visual Example of Distributed Inserts

    It’s very easy to create an example of this with a small sample database. I created one called “VizDemo2.” VizDemo2 has a slightly modified structure to illustrate what’s going on here – I need two tables that are stored separately on disk, so that they cannot interfere with one another. The simplest way to do that is with a couple of file groups containing one file each. So here’s the structure:

    1. I created the database with a 50MB, single file, Primary file group
    2. I added a file group FG1 with one 75MB file
    3. I added a second file group FG2 with one 75MB file

    When the database is empty, the visualizer shows only the system pages at the start of each file, as shown here:

    VizDemo2_01

    To that database I added two sample tables identical in structure but with different clustering keys:

    USE VizDemo2
    GO
    
    CREATE TABLE dbo.SampleCustomersInt  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON [FG1];
    GO
    CREATE TABLE dbo.SampleCustomersGuid  ( 
        id uniqueidentifier NOT NULL PRIMARY KEY CLUSTERED DEFAULT NEWID(), 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    )ON [FG2];
    GO

    I’ll populate the two tables and we can see what the file layout looks like afterward:

    INSERT dbo.SampleCustomersInt DEFAULT VALUES;
    INSERT dbo.SampleCustomersGuid DEFAULT VALUES;
    GO 20000

    Compare

    After inserts, the resulting graphic does show some facts we know to be true:

    VizDemo2_02

    First, the data in the integer-clustered index takes about eight bands of the diagram, while storing the same data in a GUID clustered index has required about twelve bands of data pages. The database itself supports that impression with space allocation – it reports these figures:

    VizDemo2_03

    Part of the extra space required is the width of the key, but part of it is the empty space on each page resulting from page splits. If a page that needs a new row is too full, then half the rows from that page are moved to a net-new page, half left in place, and the new row added to one or the other of the resulting pages. Afterward, they are often both partly empty.

    Second, the whole graphic in the GUID clustered index area is a dark blue that the visualizer uses to show fragmentation – in fact, the object is almost perfectly fragmented, with practically no contiguous pages at all. The sequence of pages in the leaf level of the index is still a linked list, as always, but it it is physically stored in essentially random order on disk.

    Does Re-Indexing Help?

    The next question is whether we can combat these problems by doing huge amounts of index maintenance – if we rewrite the GUID index, will that make it take less space, or make it more efficient? The answer is, “well, sort of, temporarily.”

    First, re-indexing will put the table in “GUID” order. Whether that really helps or not is debatable, perhaps. It would enable read-ahead for the index, which is otherwise clobbered by the fragmentation. Having the table in “GUID” order might or might not be of any help to performance. Second, re-indexing will make the pages denser, or less dense, depending on the fill factor applied. For the sake of demonstration, let’s re-index with the default fill factor, because I think that happens a lot out in the world, and it may tell us something:

    ALTER INDEX ALL ON dbo.SampleCustomersGuid REBUILD;

    After re-indexing, this is a view just of the second file group with the GUID clustered table (note that I scrolled down in the display):

    VizDemo2_04

    The arrow shows where the data was moved from the old data pages into a new region of the file. And, sure enough, it’s not fragmented (note the lighter color) and it takes less space in the file.

    That might sound good, but if this is a real database, inserts probably will continue. In the int clustered case, as we know, new data will be appended to the end of the page sequence, but in this case, new data will have to be inserted into most of the existing pages on disk. Those are all full now, and will have to be split 50/50 to create new pages for the new data, both the old and new pages will have to be written out, and the new pages by definition can’t be in index order with the existing pages.

    INSERT dbo.SampleCustomersGuid DEFAULT VALUES;
    GO 20000

    What we get after more rows are added to the table is what a layperson might call a “hot mess:”

    VizDemo2_05

    Here everything is fragmented – back to that dark blue – even the pages we just re-indexed a moment ago, because they all split. The table has more than doubled in size, even though we just doubled the number of rows, because the individual pages contain less data.

    Would appropriate fill factor be a workaround? In some measure, yes, but it really only combats the issue. The write activity on the table, even with a low fill factor, will still be higher as more existing pages have to be flushed at checkpoints. The pages will still be less dense, and therefore take up more space on disk and in cache. In short – maybe helpful but no silver bullet.

    What about Sequential GUIDs? Here I will venture my opinion. Sequential GUIDs have never made sense to me. They solve one part of this problem – the distributed insert part – but at the expense of the very things GUIDs might be good for, namely not demanding a visit to the database to generate an identifier. If you have to come to the database, you already lost this whole argument. Use an integer and solve the rest of the problem at the same time. I can only see it as a sort of band-aid for existing systems that could not be refactored, but, like a bad SUV that combines the worst properties of a car and a truck, it feels like a really poor compromise to me.

    I hope this helps to illustrate some of the physical database design challenges that surround the use of GUID cluster keys. In the next installment I’m planning to demonstrate the interleaving of objects, which is one argument for multiple file groups.

  • Visualizing Data File Layout II

    Part 2 of a blog series visually demonstrating the layout of objects on data pages in SQL Server

    Part 1

    In Part 1 of this series, I introduced a little demo app that renders the layout of pages in SQL Server files by object. Today I’ll put that app through its paces to show, in vivid color (well, teal, anyway) the destructive power of the famous Re-Index Then Shrink anti-pattern for index maintenance.

    This one is very easy to demo, so let’s go!

    First, I created a demo database VizDemo1, with a single 200 MB data file. Into that database I placed a canonical table – highly simplified for this example – clustered on an ever-increasing integer, using identity():

    USE VizDemo1
    GO
    
    CREATE TABLE dbo.SampleCustomers  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    );
    GO
    

    Then we populate that table with some dummy data:

    INSERT dbo.SampleCustomers DEFAULT VALUES;
    
    GO 40000

    And finally, fire up the little visualizer app and process the database:

    VizDemo1_01

    The small color bands at the top left corner of the image are the system tables and such that are in every “empty” database to make it run. The blue/green/teal area is the new table we created and populated with sample data, and the gray area represents empty regions in the file.

    As expected, the table started writing into the first available space, and, because the cluster key is increasing, pages were allocated to the end of the page sequence in order, and we end up with a crazy-perfect, contiguous linked list on disk.

    You can see small darker bars at intervals within the table – most of the pages in the index are “type 1” pages, which are the leaf-level/rows in the clustered index. Those bars are “type 2” index pages that have the upper level(s) of the index. The reason they are darker is that those are a disruption in the leaf level linked list, and the app shades such disruptions as a way to see fragmentation. The list has to “hop over” those pages and then continue on the other side. It’s technically fragmentation, but at this point not harmful at all – but remember that darker color that shows a break in the page order.

    A side note: in the midst of the gray area you can see one orange line (and another in the sea of teal). Those are “type 11” PFS pages, which happen on a fixed interval in every file. I don’t think they ever move – they track file allocation and free space metadata. They are like rocks in the stream…

    Now, what happens if we re-index this bad boy? Well, a re-index operation has to write all the pages for the object into new, blank pages in the file, and then abandon the old pages. I run:

    -- This "moves" all the data toward the end of the file, into free areas
    ALTER INDEX ALL ON dbo.SampleCustomers REBUILD;
    GO

    Then re-analyze the file. As expected, the table has “moved” toward the end of the file, and left free space toward the beginning. It’s still not fragmented, because we had enough room, and it was written in order into that new area by the rebuild:

    VizDemo1_02

    We can see the gray area near the top is all the “abandoned” pages where the index was, and the data has all moved down into the free area. Ah, but that seems wasteful to some people, am I right? All that empty space – the file could be smaller!

    Let’s see the damage that Shrink File does. Imagine that I do this:

    DBCC SHRINKFILE (N'VizDemo1' , 70)
    GO
    

    First, before we shrink, let’s just scroll down and look at the end of the file:

    VizDemo1_03

    We have two conditions – the gray part of the file is sort of OK to shrink. There’s just a lone PFS page out there, and removing that does no harm. But once we get into that blue area, the data has to be moved back up into the beginning of the file. Here’s where the problem lies, as I learned from Mr. Paul Randal – the shrink routine will move a page at a time back into that free space, starting from the end, going backward. That makes the pages land in approximately reverse order from the correct index order. Perfect fragmentation. Let’s see if this tool proves him right. Shrink, then re-analyze:

    VizDemo1_04

    Yep, it’s not immediately apparent, perhaps, but that teal color is a darker shade that indicates every page is a fragment boundary in most of the index – perfect fragmentation! Here’s a better view:

    VizDemo1_04a

    So, how can we clean that up? Well, with a rebuild. But … we need that bigger file. In fact, practically any database in production needs this overhead of available space to be able to perform index maintenance. It’s not “wasted” space at all.

    ALTER INDEX ALL ON dbo.SampleCustomers REBUILD;
    GO

    After the rebuild, the index is back toward the end of the file, but it’s also back in order:

    VizDemo1_05

    Now, in light of this information, imagine nightly re-indexing on a database with … AutoShrink! <shudder>

  • Visualizing Data File Layout I

    Part 1 of a blog series visually demonstrating the layout of objects on data pages in SQL Server

    Some years ago a gentleman called Danny Gould created a free tool called Internals Viewer for SQL Server. I’m a visual sort of guy, and I always thought it would be fun and educational to make a simple visualizer, like the one he created, in order to view how objects are laid out in SQL Server files, and to use it to demonstrate how operations like re-index and shrink affect the layout of files.

    To that end, and a little bit reinventing the wheel truth be told, I spent this past holiday creating a simple .NET app that renders the file layout of a database into a color-coded bitmap:

    FileLayoutViewer01

    Fig 1

    The app can scan the pages in a database, grab the header output from DBCC PAGE, parse that, and create a structure with a few key bits of information about every page. It then renders a bitmap from those structures showing a few things (Fig 1):

    Each data object (index or table) partition is identified with a unique partition ID in SQL Server. Those IDs are used in this tool to color-code the output by object, from a color lookup table. Each color in the example screenshot represents the pages that are dedicated to a single partition of an object. This screenshot shows AdventureWorks, which doesn’t use the Enterprise Edition partitioning feature, so for this case each color represents one object – every object having exactly one partition in Standard Edition (or in databases that don’t use partitioning).

    Unallocated pages are shown as gray gaps. These are regions that are part of the physical file(s), but not used to store anything.

    The app flags pages at the end of any fragment of an object using a darker colored band, so it will reveal any non-contiguous structures in the data file(s). Sometimes these happen at the end of a region of the file where one object is stored, but, interestingly, sometimes these can happen in the middle – as shown in the image above where a dark band interrupts a continuous region of the same color.

    The app has some very basic mouse-over capability where you can run the mouse over the image and the text fields at right will reveal information about the pages, including the object schema.table.index and partition, and also whether the page represents a fragmentation boundary.

    Finally, the app shows what page types are located where in the file using the narrower white/gray/black bands. White represents data or index pages, while other shades of gray or black indicate other kinds of system pages, per Paul Randal’s excellent blog post here.

    The Pixels Already Tell a Story

    So, what can we learn about the sample database in this image? Here are a few things:

    1. The part of the file shown in the bitmap is fairly dense. There aren’t big regions of unallocated space in the file. A gap in the allocated pages looks like this (enlarged):
      FileLayoutViewer02
      Empty Region
       
    2. Objects in the file are not contiguous, and may “hop around.” That is, if you follow the linked list of pages that compose an index, a bunch of them will be in a row, and then there will be a page that links to the next page composing the index but it’ll be in a different location in the file. I’ve called these “frag boundaries” – pages that do link to another page, but where that next logical page isn’t the next physical page in the file. In the graphic the frag boundary pages are colored with a darker dithered pattern. You can mouse over these and look in the text fields at the right in the app, and see the page they link to.
      FileLayoutViewer03
      Fragment Boundaries
       

      Sometimes the end of a fragment will be adjacent to pages from another object, but it can be the case that there’s a fragment boundary in the middle of the pages for one object – it’s just that the linked list goes up to that point in the file, but then the next page in the index (in index order) isn’t the next page in the file, even though the next page in the file is part of the same object. Imagine a page split in the “middle” of an index – the existing page with half the rows stays in place, and a new page with the other half of the rows gets created in the middle of the logical index order but possibly stored in some other location in the physical file.
    3. Right at the very beginning of the file there’s a special sequence of metadata pages that describe the database, allocations, and so on (again, well documented by Paul Randal). In our diagram this shows up as a series of pages at top left with varying page type (the gray and white) indicators:
      FileLayoutViewer04
      Database and file metadata pages

    In the next installment, I’ll run some test databases through this and we can see what more severe fragmentation looks like, the effect of GUID cluster keys, shrink, and how the data moves around in a re-index operation.

    Here’s a short demo video of the mouse-over working (quality is You-Tube limited):

    Animated Screen Cap of Mouse-over

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement