THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Technical content about Microsoft data technologies. All opinions expressed are purely my own and do not reflect positions of my employer or associates.

  • Stats on Stats

    I have been working on new features for the Rules-Driven Maintenance Solution, including per-index maintenance preferences and more selective processing for statistics. SQL Server stats is a topic I knew about at a high level, but lately I have had to take a deeper dive into the DMVs and system views and wrap my head around things like sys.dm_db_stats_properties() and sysindexes.rowmodctr.

    Many people have published great stuff on this, and I am indebted to Erin Stellato before and after she joined SQL Skills, also Kendra Little and Grant Fritchey, Ted Krueger, Microsoft, and others. So this post is no earth-shaking new discovery. These are just some notes from my studies.

    First, if you want to move beyond SQL Server’s auto update statistics and a nightly run of sp_updatestats, it’s important to understand where things stand with the statistics objects in the database. I wanted statistics metadata queries that:

    1. Show the age of the statistics values
    2. Show how heavily the underlying table or index has been modified since a statistics update
    3. Show the size of the indexes
    4. Allow me to create a formula relating modifications to size
    5. Work with table partitioning
    6. Work with multiple schemas

    Most of these are straightforward except for the “how heavily has the table been modified” bit. As these other authors point out, there are two places to locate that information: in older versions, the best thing is the rowmodctr column in the legacy sysindexes view, and in new versions the management function sys.dm_db_stats_properties(). The latter is actually only in service packs, post RTM, for SQL Server 2008 R2 and 2012.

    So, there are really two queries – no way around it:

    Update 21 Sept 2013: I found some improvements for these queries and updated them in place.

    /* Legacy sysindexes example */
    
    SELECT  statsSummary.schemaName,
            statsSummary.tableName,
            statsSummary.statsName,
            statsSummary.object_id,
            statsSummary.stats_id,
            statsSummary.statsUpdate,
            statsSummary.statsType,
            statsSummary.indexName,
            statsSummary.index_id,
            statsSummary.modCounter,
            rowcounts.row_count 
    FROM (
            SELECT  sc.name schemaName,
                    o.name tableName,
                    s.name statsName,
                    s.object_id,
                    s.stats_id,
                    STATS_DATE(o.object_id, s.stats_id) statsUpdate,
                    CASE WHEN i.name IS NULL THEN 'COLUMN' ELSE 'INDEX' END AS statsType,
                    ISNULL( i.name, ui.name ) AS indexName,
                    ISNULL( i.index_id, ui.index_id ) AS index_id, 
                    si.rowmodctr AS modCounter
            FROM sys.stats s
                INNER JOIN sys.objects o ON s.object_id = o.object_id
                INNER JOIN sys.schemas sc ON o.schema_id = sc.schema_id
                LEFT JOIN sys.indexes i ON s.object_id = i.object_id AND s.stats_id = i.index_id
    
                -- Getting rowmodctr from sysindexes for legacy versions
                LEFT JOIN sys.sysindexes si ON s.object_id = si.id AND s.stats_id = si.indid
    
                -- If the statistics object is not on an index, get the underlying table
                LEFT JOIN sys.indexes ui ON s.object_id = ui.object_id AND ui.index_id IN ( 0, 1 )
        ) AS statsSummary
        INNER JOIN (
            SELECT object_id, index_id, SUM(row_count) row_count
            FROM sys.dm_db_partition_stats
            GROUP BY object_id, index_id
            HAVING SUM( row_count ) > 0
        ) AS rowCounts ON statsSummary.object_id = rowCounts.object_id 
            AND statsSummary.index_id = rowCounts.index_id
    
    /* 2008 R2 SP2 and higher example */
    
    SELECT  statsSummary.schemaName,
            statsSummary.tableName,
            statsSummary.statsName,
            statsSummary.object_id,
            statsSummary.stats_id,
            statsSummary.statsUpdate,
            statsSummary.statsType,
            statsSummary.indexName,
            statsSummary.index_id,
            statsSummary.modCounter,
            rowCounts.row_count 
    FROM (
            SELECT  sc.name schemaName,
                    o.name tableName,
                    s.name statsName,
                    s.object_id,
                    s.stats_id,
                    STATS_DATE(o.object_id, s.stats_id) statsUpdate,
                    CASE WHEN i.name IS NULL THEN 'COLUMN' ELSE 'INDEX' END AS statsType,
                    ISNULL( i.name, ui.name ) AS indexName,
                    ISNULL( i.index_id, ui.index_id ) AS index_id, 
                    sp.modification_counter AS modCounter
            FROM sys.stats s
                INNER JOIN sys.objects o ON s.object_id = o.object_id
                INNER JOIN sys.schemas sc ON o.schema_id = sc.schema_id
                LEFT JOIN sys.indexes i ON s.object_id = i.object_id AND s.stats_id = i.index_id
    
                -- If the statistics object is not on an index, get the underlying table
                LEFT JOIN sys.indexes ui ON s.object_id = ui.object_id AND ui.index_id IN ( 0, 1 )
    
                CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) AS sp
        ) AS statsSummary
        INNER JOIN (
            SELECT object_id, index_id, SUM(row_count) row_count
            FROM sys.dm_db_partition_stats
            GROUP BY object_id, index_id
            HAVING SUM( row_count ) > 0
        ) AS rowCounts ON statsSummary.object_id = rowCounts.object_id 
            AND statsSummary.index_id = rowCounts.index_id

    Using one or the other of these, it’s possible to make a list of statistics objects with a comparison of modifications to the underlying table, row count and so on, and then build update statements to act accordingly. This can be useful for large tables where reaching the auto update threshold would take too long, or to prevent updating statistics on tables where there have been only a few modifications but sp_updatestats would update the statistics anyway, possibly with a lower sample value than one would like. The classic example here is to rebuild and index, where you get stats with perfect sampling, and then replace those perfect stats with worse ones as a result of updating statistics with default sampling.

    The Problem with Rowmodctr

    Ted Krueger has a really nice demo showing some cases when auto updates will happen here. Ted does a great job explaining the 500-plus-20-percent rule that SQL Server seems to use for auto update (note, though, the xevents code in the post is only for SQL Server 2012 or later). But there is one gotcha with the way the server counts modifications to a table.

    Ted doesn’t do this, but in other contexts, it’s easy to throw around the 500-plus-20-percent “of the table is modified” rule without strictly observing what that means. For example, is that rule the server ensuring that 20 percent of the different rows in the table - that is, a number of rows that comprise 20 percent of the total number of rows in the table – get modified? Probably not. How do repeated updates to the same rows count? What about data in the table/index that isn’t a key, but just an included column?

    Secondly, it’s important to note for auto-update that the actual action of updating the statistics is not triggered by the changes to the data. I used to think that it was triggered by a select statement against the table where the statistics are invalidated by modification, but that, as it turns out, is only partly true. Really the statistics are only updated when the optimizer needs to use them to make a new query plan. There are cases where a query is already cached, and the threshold for auto update is exceeded, but the statistics are not updated because the optimizer has not requested an update in order to optimize a new query.

    The developers for SQL Server obviously have more access to internal data and counters than we do. Reading between the lines on this type of statistics update vs. table modification DMV issue, I get the sense they are a little bit circumspect about making all this data fully exposed and documented. So, we have what we have, which is STATS_DATE(), sys.dm_db_stats_properties().modification_counter and sys.sysindexes.rowmodctr.

    Here’s a demo showing some of the nuances of this, and how the rowmodctr or modification_counter act in perhaps counterintuitive ways.

    First, set up a test database with views that contain slightly tweaked versions of the stats metadata queries, so we can call them easily, and a test table and indexes:

    CREATE DATABASE StatsDemo;
    GO
    
    USE StatsDemo
    GO
    
    CREATE VIEW dbo.stats_metadata AS
    
    SELECT  statsSummary.schemaName,
            statsSummary.tableName,
            statsSummary.statsName,
            statsSummary.object_id,
            statsSummary.stats_id,
            statsSummary.statsUpdate,
            statsSummary.statsType,
            statsSummary.indexName,
            statsSummary.index_id,
            statsSummary.modCounter,
            rowCounts.row_count 
    FROM (
            SELECT  sc.name schemaName,
                    o.name tableName,
                    s.name statsName,
                    s.object_id,
                    s.stats_id,
                    STATS_DATE(o.object_id, s.stats_id) statsupdate,
                    CASE WHEN i.index_id IS NULL THEN 'COLUMN' ELSE 'INDEX' END AS statstype,
                    ISNULL( i.name, ui.name ) AS indexName,
                    ISNULL( i.index_id, ui.index_id ) AS index_id,
                    sp.modification_counter AS modCounter
            FROM sys.stats s
                INNER JOIN sys.objects o ON s.object_id = o.object_id
                INNER JOIN sys.schemas sc ON o.schema_id = sc.schema_id
    
                -- If a statistics object is on an index, get that index:
                LEFT JOIN sys.indexes i ON s.object_id = i.object_id AND s.stats_id = i.index_id
    
                -- If the statistics objects is not on an index, get the underlying table:
                LEFT JOIN sys.indexes ui ON s.object_id = ui.object_id AND ui.index_id IN ( 0, 1 )
                OUTER APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) AS sp
        ) AS statsSummary
        INNER JOIN (
            SELECT object_id, index_id, SUM(row_count) row_count
            FROM sys.dm_db_partition_stats
            GROUP BY object_id, index_id
            HAVING SUM( row_count ) > 0
        ) AS rowCounts ON statsSummary.object_id = rowCounts.object_id 
            AND statsSummary.index_id = rowCounts.index_id
    where schemaName = 'dbo'
    
    GO
    
    CREATE VIEW dbo.stats_metadata_legacy AS
    
    SELECT  statsSummary.schemaName,
            statsSummary.tableName,
            statsSummary.statsName,
            statsSummary.object_id,
            statsSummary.stats_id,
            statsSummary.statsUpdate,
            statsSummary.statsType,
            statsSummary.indexName,
            statsSummary.index_id,
            statsSummary.modCounter,
            rowcounts.row_count 
    FROM (
            SELECT  sc.name schemaName,
                    o.name tableName,
                    s.name statsName,
                    s.object_id,
                    s.stats_id,
                    STATS_DATE(o.object_id, s.stats_id) statsUpdate,
                    CASE WHEN i.index_id IS NULL THEN 'COLUMN' ELSE 'INDEX' END AS statsType,
                    ISNULL( i.name, ui.name ) AS indexName,
                    ISNULL( i.index_id, ui.index_id ) AS index_id,
                    ISNULL( si.rowmodctr, usi.rowmodctr ) AS modCounter
            FROM sys.stats s
                INNER JOIN sys.objects o ON s.object_id = o.object_id
                INNER JOIN sys.schemas sc ON o.schema_id = sc.schema_id
    
                -- If a statistics object is on an index, get that index:
                LEFT JOIN sys.indexes i ON s.object_id = i.object_id AND s.stats_id = i.index_id
                LEFT JOIN sys.sysindexes si ON i.object_id = si.id AND i.index_id = si.indid
    
                -- If the statistics objects is not on an index, get the underlying table:
                LEFT JOIN sys.indexes ui ON s.object_id = ui.object_id AND ui.index_id IN ( 0, 1 )
                LEFT JOIN sys.sysindexes usi ON ui.object_id = usi.id AND ui.index_id = usi.indid
        ) AS statsSummary
        INNER JOIN (
            SELECT object_id, index_id, SUM(row_count) row_count
            FROM sys.dm_db_partition_stats
            GROUP BY object_id, index_id
            HAVING SUM( row_count ) > 0
        ) AS rowCounts ON statsSummary.object_id = rowCounts.object_id 
            AND statsSummary.index_id = rowCounts.index_id
    WHERE schemaName = 'dbo'
    GO
    
    CREATE TABLE dbo.SomeData(
        id int IDENTITY(1,1) NOT NULL,
        val1 nvarchar(50) NULL,
        val2 nvarchar(50) NULL,
        val3 nvarchar(50) NULL,
     CONSTRAINT PK_SomeData PRIMARY KEY CLUSTERED  (
            id ASC
        )
    );
    
    CREATE NONCLUSTERED INDEX IX_Val1 ON dbo.SomeData (
        val1 ASC
    );
    GO

    Next, check the statistics views. They should be empty, because our test table starts with no rows, and the those views filter out zero-row tables:

    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy

    Now we will put a row in the test table. This will invalidate the statistics, and then when the optimizer needs to optimize a query that uses the key columns in the indexes – not when the row is inserted – the stats will auto-update. Look closely at the results at each stage, and the timestamps on the stats update column. Also note that the placement of GO is important, because we need the text of the select statements/batches to be identical later in the demo:

    -- From zero to more than zero rows
    
    INSERT INTO dbo.SomeData (
        val1,
        val2,
        val3
    ) VALUES (
        'A', 
        'B', 
        'C' 
    )
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO
    
    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO
    
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO

    I get this, which is expected:

    StatsUpdateResults

    The mod counter and row_count columns make sense, and the statsUpdate dates support the idea that a stat for an index is updated when the optimizer needs to use it for a query and the data in the table has been changed.

    Now things get less intuitive, though.

    -- From 1 to 20 rows
    
    INSERT INTO dbo.SomeData (
        val1,
        val2,
        val3
    ) VALUES (
        'A', 
        'B', 
        'C' 
    )
    GO 19
    
    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO
    

    So far, so good. 20 rows does not trigger an update. At this point the mod counters show ‘19’ changes, which makes sense.

    Now, what happens if we do a lot of updates, but just to a single row:

    UPDATE dbo.SomeData SET val1 = 'Z' WHERE id = 1
    GO 1000
    
    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO
    StatsUpdateResults2

    OK, now wait a second. We updated the table 1000 times (more than 500 + 20 percent). We updated a column that is in both the nonclustered index and also in the clustered index. Why is the mod counter 19 for the clustered index and 1019 for the non? And how is the mod counter so much higher than the actual number of rows in the table? And why no statistics update? The date stamp did not change. Lots of questions.

    Consider: the updated column is a key in the nonclustered index but just a plain old column in the clustered index. So it appears, and this makes sense, that modifications are tracked to the keys related to statistics but not other columns. So the mod counters update when we change a key value of a specific index, and just for that index.

    Next, consider whether the select query is being optimized at this point, or is in cache from before. Hmm. More on this later.

    Lastly, I want to point out that the mod counter has been bumped 1000 times, where we only executed an update to one row, and didn’t change the actual value for 999 of those updates.

    So: modification counter is not sensitive to which rows in a table are changed. It’s not “20 percent of the rows” in that rule of thumb. What it really seems to mean is each time an update/insert/delete is performed, the number of rows affected by each operation is just incremented. It doesn’t mean they were different rows or even different values. This makes sense, because you would have quite a lot of overhead to actually track how many distinct rows are changed with before and after values. I believe what happens is a bit like, for every query “get the result of @@rowcount where the query changed a key column in the index, and add that to the modification counter.”

    If you need more proof, here’s a query that does change a value on each iteration, but for the same row:

    UPDATE dbo.SomeData 
    SET val1 = CASE val1 WHEN 'A' THEN 'Z' ELSE 'A' END 
    WHERE id = 1
    GO 1000
    
    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO
    Still no stats update. Same pattern in the stats metadata.

    How about, then, changing more of the rows?

    UPDATE dbo.SomeData 
    SET val1 = CASE val1 WHEN 'A' THEN 'Z' ELSE 'A' END 
    --where id = 1
    GO 1000
    
    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO

    Now this is really odd. My statistics metadata shows no update to the stats and 22,019 modifications to the 20 row nonclustered index, but still 19 changes to the clustered index. Huh?

    When you think about it, though, it does make sense. The mod counter is just incremented with the number of affected rows for keys in the index. If a key is not affected, then the counter doesn’t increment. Therefore the statistics for the clustered index are still valid – no keys have changed. There have been no changes to those keys since we populated the table, even though we have changed other data. Statistics about non-key columns are not relevant, so changes to them are not relevant.

    Now let’s bulk up the table, past those milestones of 500 rows and so on:

    -- From 20 to 499 rows
    
    INSERT INTO dbo.SomeData (
        val1,
        val2,
        val3
    ) VALUES (
        'A', 
        'B', 
        'C' 
    )
    GO 479
    
    
    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO
    
    -- From 499 to 500 rows
    
    INSERT INTO dbo.SomeData (
        val1,
        val2,
        val3
    ) VALUES (
        'A', 
        'B', 
        'C' 
    )
    GO 1
    
    
    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO
    
    -- From 500 to 5000 rows
    
    INSERT INTO dbo.SomeData (
        val1,
        val2,
        val3
    ) VALUES (
        'A', 
        'B', 
        'C' 
    )
    GO 4500
    
    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO
    
    

    StatsUpdateResults3

    Would it surprise you to note that there still has been no update to statistics? Or that the mod counter now shows 26,999 changes to this 5,000 row table?

    The answer? Well, I took a look in the plan cache:

    SELECT UseCounts, Cacheobjtype, Objtype, text, query_plan
    FROM sys.dm_exec_cached_plans 
        CROSS APPLY sys.dm_exec_sql_text(plan_handle)
        CROSS APPLY sys.dm_exec_query_plan(plan_handle)
    WHERE text like '%SELECT%FROM%SomeData%'

    In my run of this demo, I have 9 (!) versions of the two test select statements above that are in cache, some are adhoc, some are auto parameterized. But they are optimized already.

    If I do this (only in test, please!):

    DBCC FREEPROCCACHE

    And then I do this:

    SELECT val1 FROM dbo.SomeData WHERE val1 = 'A'
    GO
    SELECT val1 FROM dbo.SomeData WHERE id = 1
    GO
    
    SELECT * FROM dbo.stats_metadata
    SELECT * FROM dbo.stats_metadata_legacy
    GO

    I do finally get a statistics update, and the mod counters reset to 0:

    StatsUpdateResults4

    (This does not mean you should free the proc cache on your server, this is just a demo)

    So, what are the takeaways here?

    1. Rowmodctr or modification_counter does not track the “percentage of the table that has been modified,” a common misconception. I think they really track the “number of times a DML operation affected key values in the index.”
    2. The counters know nothing about whether the modifications were to one row repeatedly or to many rows. They are not perfect – and they can’t tell you, I don’t think, “how much” of the table is changed.
    3. Auto update stats is triggered by query optimization – that is, the optimizer needs to take a “new” query, where the statistics used for optimization have been invalidated, and it will request new statistics, assuming auto update stats is on, which are collected just before this new query is optimized. On a real, running system this probably happens very quickly, but as this demo shows, some cases exist where a plan might be out in cache and no optimization happens to make the stats update.

    This was very educational for me, and is shaping how I plan to updated the rules-based maintenance from “any row changed” for stats to something, hopefully, more accurate. The mod counter values are so coarse, though, that that can’t, I’m finding, be an exact science.

  • Small-Footprint Laptop Lab

    Champagne Dev VMs on a Beer Budget

    Others have posted on this very topic, but I rebuild my home/dev VM lab this past weekend, and it went so well I thought I’d share the steps I went through, in case it helps someone else to get started. I got my start doing this a while back with a very helpful post by Jonathan Kehayias on Virtualbox, but I have evolved my home lab since.

    As I see it, today you have two choices for this sort of work

    1. A cloud service like Azure or Amazon, etc.
    2. A small, local VM farm on your laptop with your own storage and hypervisor.

    There are advantages and disadvantages to each, obviously. Using the cloud you can have zero footprint on your own hardware, and not worry too much about disks and so on. On the other hand, you are tied to an internet connection, which can cramp your style when doing presentations.

    The cloud frees you from a lot of setup tasks. Setup can be, depending on your perspective, dull work you need to plow through or a learning experience. I personally still like going through the process because I learn so much each time. Building servers and performing installation is still part of my job, at least for the foreseeable future. With the cloud you typically click through a selection from a catalog and your VM comes pre-built.

    Just for the experience, I did both in the past few days.

    Azure VM

    Azure is quite straightforward. Here’s a screenshot showing an RDP session to my remote Azure VM, which came with SQL Server 2014 CTP preinstalled:

    AzureRDP

    This is a VM with SQL Server installed – not the Azure PaaS SQL Server offering – so I have the whole OS to fool with here. Creating this was super simple. I have the advantage of an MSDN subscription, which provides credits on Azure for dev/test work. I activated that, and then just followed the instructions on the web site to provision a VM from a catalog image of a Windows 2012 / SQL 2014 CTP virtual machine. Literally click New > Virtual Machine > From Gallery and choose from the catalog.

    You end up with the machine, RDP access, and also a nice portal dashboard with stats about the VM:

    AzureDB

    A few things commend this approach:

    1. Super easy
    2. Nothing on your machine
    3. Not permanent – the VM can be deleted easily when you are done
    4. This is what the future looks like, truly. It’s useful to learn about it and mitigate what Grant Fritchey has aptly called Buggy Whip Weaving.

    Old-School VMs

    ExtSSDIt’s still feasible, of course, to install VMs locally, and these days it doesn’t cost much. I do this with 16 GB RAM in my laptop and a modest USB 3 external disk enclosure with a Crucial SSD inside. This setup works well for demo and learning VMs. I did have to purchase a USB 3 express card, my laptop lacking built-in USB 3, and the only downside is that the express card itself has proven dodgy*. It does not like fancy stunts like “Sleep” or “Restart” or “Eject” and so on. But I blame the card, not the solution architecture.

    Hosting the VMs takes some hypervisor. In my case I have tried several but settled on Windows 8’s built-in Hyper-V. It’s free with the OS, all the basics work, and it has good PowerShell support, which I appreciate.

    The main challenges with Hyper-V I experienced:

    1. No easy clipboard support to copy and past into and out of VMs with the console
    2. It just flat will not work to share my wifi in the laptop with the VMs. In order to connect the VMs to the Internet, which I only do intermittently for stuff like patching, I have to plug in to a physical network. I believe this is by design, but I don’t exactly understand the reasoning behind it. I spent a fair amount of time trying to fight this one but gave up.

    Steps

    For this generation of my home lab I followed this basic outline:

    1. Download Windows Server 2012 Datacenter Evaluation from Microsoft as a VHD. This is a pre-built virtual hard drive with the OS “installed” on it, ready to complete setup and use.
    2. Download SQL Server 2012 Enterprise Evaluation Edition as an ISO. I went this way because I do want to go through the setup steps.
    3. Extract the delivered VHD download to my external SSD disk and rename it.

    For the setup I wanted, I need a domain controller and three or four SQL Servers. SSDs are fast, but capacity is expensive, so I found I could save a lot of space by using Hyper-V’s differencing disks. This basically allows you to make a “parent” hard drive image file, and then derive multiple virtual hard drives from it, storing only the differences. Other hypervisors also have this feature. By using this technique I am able to store four servers’ hard disks in about 28 GB of space on disk. The trick is the install order, and what will maximize the shared disk space.

    Four VMs’ hard drives with Differencing:

    HyperVDifferencingDisks

    I got good results with these steps:

    Make a parent virtual hard drive for the OS

    1. In Hyper-V manager, make a virtual machine, and choose the downloaded Windows 2012 VHD file from Microsoft as the hard drive.
    2. I use two virtual switches in my Hyper-V setup, one for internal communication and one to connect out to the Internet. I set the VM up with a virtual network adapter for each.
    3. Start the VM and go through the setup process. This process will make the "parent” hard drive in the differencing disks,  storing the bulk of the bits for the OS, but I want to embed the things that all the servers will share. So:
    4. Patch the machine by configuring Windows Update while connected to the Internet.
    5. Install .NET Framework 3.5 in Server Manager > Features while connected to the internet. This is important because SQL Server requires this, but it’s not on the VHD from Microsoft, and feature setup has to download it using Windows Update.
    6. This provides a working server hard drive that will allow us to derive the others without needing as much space as a full copy of the VHD.
    7. Sysprep this server by running sysprep.exe with the “generalize” option, so we can make clones.

    Make a domain controller

    1. The first real machine is the DC. Shut down the “parent” vm above.
    2. In Hyper-V Manager, edit the properties of the virtual machine (or make a new one and specify that you’ll “attach a virtual hard disk later.”)
    3. Set the hard drive to be a new VHD file, using the differencing option. This file will use snapshot technology to retain only the blocks on the disk that are different from the shared parent disk, in a sense de-duplicating most of the hard drive content.
    4. Choose the virtual hard drive from above as the parent for the differencing disk.
    5. Boot the VM and go through the configuration tasks.
    6. The DC will want to have a static IP address. Visit Settings > Control Panel > Network and Internet > Network and Sharing Center > Change adapter settings. Right-click the network adapter and choose properties. Locate Internet Protocol Version 4. Click Properties and set the adapter to a static IP address. To keep things simple I usually use 10.10.10.1, then 10.10.10.2, then 3, 4, as I add virtual machines to the lab.
    7. From Server Manager, add roles and features, add the AD domain controller role, and follow the wizard there to set this up as the DC for the lab. I just use a domain called “arch.lab.”
    8. That gets us a little domain controller.

    Make a parent virtual hard drive for SQL Servers

    1. Create a new VM in Hyper-V Manager, using the same technique: choose the option to “assign a virtual hard drive later,” then add a new virtual hard drive using the Differencing option, and the OS parent hard drive from the first section above.
    2. Boot this new VM, and run through the OS setup
    3. Install SQL Server on this vm, by mounting the SQL Server ISO file as a DVD in the virtual “dvd drive” provided by Hyper-V.
    4. Sysprep this VM, so it, too, can be cloned.
    5. Shut this VM down. This virtual hard drive will become the parent for all the SQL servers, basically de-duplicating the SQL Server product bits, and again saving space.

    Make the actual SQL Servers

    1. Create three or four more virtual machines in Hyper-V manager, and for each make a new virtual hard drive of the differecing type, where the parent SQL Server VHD from the preceding section is the parent. Each of these machines will act like SQL Server is “preinstalled,” because of the previous steps, and the redundant bits will be stored only once in the parent VHD file.
    2. Start these and run through the configuration steps to set them up.
    3. Set a unique static IP address for these machines like the DC.
    4. Use Server Manager > Local Server > Computer Name to add each machine to the lab domain.

    Conclusion

    This was fun and educational for me, and now I have a fresh lab for talks I am giving this fall. You can, I hope, see the appeal of Azure, though. Even though I made only one Azure VM, it took a few clicks and most of the installation pain is just not there. Making the same four VMs as on my local lab, I can tell, would be fairly simple.

    Update 20 Sept 2013: I found the Express card problem (Huzzah!). It was the defaults for PCI power saving in my laptop BIOS.

  • Education, Money, Happiness

    It’s back-to-school time. The MCM program has been canceled. A good time to reflect.

    I have two little boys, you might know, Will and Robbie. Will is a third grader this year, and Robbie starts Kindergarten tomorrow. I have been through “school selection,” which, especially if you have kids, really means, “Where must we live so our kids have a decent education?” With the littlest one starting school in earnest, and the cancelation of the MCM - and the hand-wringing and worry and blogging about that throughout the SQL community – I have been mulling over this problem of education a lot.

    School and I go Way Back

    I am probably an over-educated nerd, to be honest. I have, taken together, eight years of college education. I don’t have exactly the degree you would expect from all those years, but they were all valuable – the papers I do have are for two Architecture degrees. I studied Architecture for six years, computer science for two. Before that I was in a classic New England prep school, where going to college was just an expected norm. I did an exchange program in Sweden during that time, and lived and went to a college prep program in Linköping (pronounced roughly leen sherping). Both my parents are retired teachers from U.S. public schools, my father a French teacher, my mother, a music teacher.

    I even taught, albeit as a very young visiting lecturer rather than a professor, at a prestigious university for a time. I worked alongside career academics and got a little of the flavor of that life.

    Architecture is difficult, and heavily regulated, and so has a very long and onerous program of university, internship, testing and accreditation, followed more continuing ed and more accreditation. When I went to school it took at least 8-10 years from high school graduation to become an architect. I never did, though I worked in the field for ten years. Most architects never make much money. Some stars do. The movie image of an architect is just that. Imaginary.

    What I mean to say is, I have spent my whole life in relationship to schools, to teaching, to certifications, to the measure of a framed piece of paper on the wall, and what that means in light of career, money and happiness. It was an expectation from my parents, who, as teachers, are whole-hog invested in education as a basic moral imperative. Elaborate certification was a specter during my whole time designing buildings, one that I resisted. “Certs,” as we call them in IT, are a constant topic in the SQL Server community. Now, as a parent, I have to make up my mind about how this continues with my boys.

    Consider This

    This is a very good series of short videos from T.V. personality Mike Rowe, that I found online via Andy Leonard:

    http://profoundlydisconnected.com/alternative-education/ 

    Also consider this related older thread from Andy, which is thought-provoking:

    http://sqlblog.com/blogs/andy_leonard/archive/2011/06/27/my-name-is-andy-and-i-have-an-associates-of-applied-science-degree.aspx

    Common Myths

    I don’t know much, but after long years dealing with and thinking about these issues, I know for a fact that our society carries some mythology, perpetuated by media and some institutions, that doesn’t help families make good choices about education.

    Myth: College is About Getting a Job. College is Job Training.

    There are some college programs that look very much like job training – engineering, for example, or law or medicine. It might be that a large percentage of graduates from those programs go on to work in their “chosen field.” It might even be that new entrants into some professions have to have a degree in the field. Still, and call me crazy, I am weary of hearing that education is only important to ensure young people have job training, so they can be employed, because that idea is so misguided. We sell our whole culture short by constantly reducing this issue to employment (pun intended). We sell our young people short every time someone makes a wisecrack like, “oh, Art degree – would you like fries with that? Har, har. Yuk, yuk.” Yet this is what we hear, over and over: employment as the only important rationale for education. People sometimes say, “Education can enrich your life,” and that is more true – but they are rarely taken seriously.

    Here’s the truth: if you live in fear of not having essentials like food, heat, a roof over your head, you’ll be miserable. If you live in constant worry that you will lose your job, you’ll be miserable. If you live with a horrible job that you hate, but you have to keep it to survive, you’ll be miserable.

    If you look at those statements, you might note that college is “orthogonal” – an overeducated word for “unrelated” – to any of those things. I have a good job and make a decent living in a field totally unrelated to the professional program I attended in school. It doesn’t mean the education was lost, or didn’t have value, it means the real value of that education isn’t job training.

    Myth: College Will Help You Make a lot of Money.

    Truth: C students and dropouts are often richer than we are, the nerdy A students. Anybody out in the workplace with open eyes should know that. There might be exceptions for some Ivy-league CxOs or surgeons or lawyers, but in my experience education and serious wealth are not directly related. Why is that? This could be a whole other post, but to summarize my belief on this topic:

    • Nobody makes serious, exotic-car-collection, multiple-house money by working for a wage or salary. You make money by controlling wealth-generating assets. This is because other people will not pay you great piles of money for your individual labor, they will only pay you great piles of money because you control assets like companies, labor, material, technology. There is little related to education in that equation. What it takes is the insight and skill to gain control of assets – if that’s your cup of tea.
    • You never make serious money unless the main thing you are paying attention to is making money. If your priorities are elsewhere, other people probably get the money. That’s not necessarily a bad thing, but there it is. There are a few stories of people who, for example, invented some technology and stumbled into riches, but in fact the majority of rich people are rich by focusing single-mindedly on getting rich. University education will provide you no end of distractions from getting rich.
    • Getting and keeping serious money requires being blind to, or stomaching, certain ethical dilemmas. University might well deepen those dilemmas.

    So, if you want great piles of money – and perhaps question that impulse – then don’t focus on higher learning. You need to be literate, and you need to be able do do math and such, but get control of a business instead of worrying about some Master’s degree. It’s more effective to learn business in business than in a university.

    But, I am not rich, and I work for a salary, so at the end of the day, I probably cannot speak with authority on this.

    Myth: College Will Guarantee You Make a lot of Money.

    We do our kids a disservice equally by claiming that going through a college degree program somehow guarantees you wealth. There’s the economy, there’s the job market, there’s supply and demand. There are statistics that show college graduates may be more employable as young people, but for an individual kid there’s no relationship here. If you are smart and resourceful I think good career paths are out there either way. Likewise, a college graduate can be unemployable, or fixated on a specific career for which there are no jobs, or where all the jobs have low pay. This college=rich thing is nonsense.

    Myth: Money Makes People Happy. Therefore, College is Required for Happiness.

    Here’s the double-edged myth that rolls the previous three into a nice surprise package with a bow on top. There’s a message out there, repeated endlessly: College = Money = Happiness. The problem for kids is that, with one shot and a one-way ticket through life, if you discover this is a myth when you are 35 and miserable, a lot of time has been wasted. We need to remind kids that happiness is happiness. Once you get past the basic economic necessities so that you are comfortable, once you have activities and vocation that you find rewarding, it’s a statistically proven fact that happiness is not related to wealth. I think happiness might be related to education in some ways, which I’ll talk about in a minute, but that relationship does not detour through job training and money.

    Myth: College is / is not Worth the Money.

    With all this ranting, you might have the impression I am against sending kids to college – but nothing could be further from the truth. I am only ranting against all the undue emphasis on money. Here’s the thing: college really can enrich your life, but it’s not with dollars. You get one life, and all you have, really, on the last day, is your experience. The point is to live well, and for many people, going to college helps. It helps you grow up, it helps you think about things in new ways, it expands your view of the world. It helps you in your job, but the important ways that it helps are not as job training. It helps you write, for example, or meet deadlines, or communicate. It helps you with strategies for problem solving. Even then, it’s not necessary for everyone.

    After the first one or two jobs that a young person has, it is my fervent hope that employers don’t blindly make college degrees a requirement for job candidates, because all these soft job skills can also be acquired in other ways. It’s not that you must go to college to get them. College just helps some people get those skills faster. After a few years, in many professions, the college degree just should not carry that much weight. Look at the real qualifications of the candidate.

    College is expensive. Is it worth the money? If you went, and you are miserable, then it wasn’t. If you went, and you have a better life because of the experience then it was. The constant analysis of “this degree for this job has ROI where this one doesn’t” entirely misses the point. When I think about my kids’ future, the college question really comes down to this: in their one trip through life, will college help them to be happy, productive, satisfied? That’s all. That’s whether it’s worth the money.

    English degree makes you literate, and being literate makes you happy? You have a decent job in some completely unrelated field, and that job is rewarding? That’s ROI.

    Myth: Everyone Needs a Degree.

    So, I don’t think a college degree makes everyone happy, or employable, or rich. So, while higher ed was right for me, and I am so happy to have it, I’m with Mike Rowe on this. Let’s be sensible about it, instead of forcing it on every kid, for imaginary economic reasons. We should instead, as he argues, make lots of different opportunities for lots of different kids in many fields.

    I’ve made a comfortable living, but not much money. We are not rich. I could have been smarter about what I have earned, in some ways. But my education is part of who I am, and I like knowing the things I learned. I’m happier for it.

    Certs

    That brings me around to the MCM program. I have thought a lot about going after MS certs and the MCM. I’d have a lot to learn, but I do feel like I could earn that. I personally have not gone after the lower certs, because, in the back and forth of whether they carry real weight or not over eight or so years, I always seem to land on the side where I think my experience is enough without the paper. I already have the motivation to hone my skills and learn this DBA craft on my own, without the deadlines or the tests. And with the web, and a modest book budget, the learning is mostly free.

    I have huge respect for those who have earned the MCM title, or who worked to create it, and I can imagine how disappointing it was when Microsoft made their blunder of an announcement. Counting some of the MCMs among my friends, and knowing how expert they are, I know this was a genuine, thorough, high bar for DBAs.

    That said, and not discounting their accomplishments, I think Brent Ozar had a really insightful take on this issue. It’s hazardous aligning something like lifetime educational goals, akin to the way I think about a college experience, with the very simple, hard realities of a technology corporation like Microsoft or Cisco. Those entities are going to do what they need to do economically, and teaching or certification, even while they may be well intentioned, are going to be bound to that corporate interest. I feel like it’s important to keep that in perspective. Insofar as the low-level technical detail of training and certs like this has such a short life – three years? – and all of it is bound, at the end of the day, to selling technology, then it really is the learning that matters.

    The bright side? Microsoft can remove the program, but they can’t take the learning or experience away from the the people who participated, and, at least in my view, that’s the important part the experience anyway.

  • IF NOT EXISTS ( SELECT ‘thisPost’ ) CREATE POST thisPost AS

    T-SQL deployment scripts are tedious to write and they get no love. As a result, they are very often the most horrible, error-prone step-children of scripts*. One basic principle that could help is to develop a habit of using the “IF NOT EXISTS” or “CREATE OR ALTER” patterns in these scripts. The idea is to first check the existing database before creating new objects. This is far better, in my opinion, than the common “DROP and CREATE” pattern, which can lose data and/or permissions if misapplied. Whenever possible, this type of deployment code should use the SQL Server Catalog Views, and not the deprecated ones. INFORMATION_SCHEMA has been demonstrated to be flakey by others.

    Make It So

    I like to adopt an approach where deployment scripts

    • Are re-runnable without harm. I’m calling this the “F5 F5 F5 F5 F5” principle. DROP AND CREATE is your enemy.
    • Should complete without errors when successful. (Errors should be real; none of this “ok, expect errors from the script and ignore them.” That’s just dangerous.)
    • Take whatever is there on the server, within reason, and change it to what the script says should be there. This is a declarative approach a bit like Picard’s orders to Number 1. Make It So. If an object is missing, then create it. If it’s there already, then move on. If it needs modification, then change it.

    The hard part about IF NOT EXISTS is that you need all these weird snippets of code to check system views for objects in the database. These are repetitive, but hard to figure out from scratch, and who has time for that? Well, I’m making time. I have had this post idea in the back of my head for a long time. This page will be a clearing house of code snippets to check for the existence of objects, and I’ll come back and add or adjust it in the future. You should be able to just snag what you need for that ugly deployment script and make a few edits.

    Use SQLCMD to Stop on Error and for Script Variables

    /* Optional but Recommended: use SQLCMD to set database name and stop on errors */
    
    :on error exit
    :setvar dbname myDatabase
    
    

    Notes: SQLCMD is a great addition to deployment scripts that allows one to avoid hard-coding things like database names, and to stop script execution if an error is encountered. You have to enable SQLCMD mode in SSMS to take advantage of this. Query > SQLCMD Mode

    Every Script Needs Database Context

    Somewhere you must direct the deployment script to use the correct database. If you are doing that from a calling function like a PowerShell script, it could happen there. If not, your script needs:

    USE [$(dbname)]

    This should happen one time, near the top, after SQLCMD variable assignments and comments.

    Notes: without this, chances are people will accidentally fill master with junk. Truly.

    Create a Database if it Doesn’t Exist

    /* Create the database only if it's not already present */
    USE master
    
    IF NOT EXISTS ( SELECT name FROM sys.databases WHERE name = '$(dbname)' )
    BEGIN
        CREATE DATABASE [$(dbname)] ;
        ALTER DATABASE  [$(dbname)] MODIFY FILE
        ( NAME = N'$(dbname)', SIZE = 200MB, MAXSIZE = UNLIMITED, FILEGROWTH = 200MB ) ;
        ALTER DATABASE  [$(dbname)] MODIFY FILE
        ( NAME = N'$(dbname)_log', SIZE = 50MB, MAXSIZE = 2048GB, FILEGROWTH = 50MB ) ;
    END
    GO
    
    /* Set any important database-level options */
    ALTER DATABASE [$(dbname)] SET RECOVERY { FULL or SIMPLE } ;
    ALTER DATABASE [$(dbname)] SET AUTO_SHRINK OFF ;
    ALTER DATABASE [$(dbname)] SET AUTO_UPDATE_STATISTICS ON ;
    ALTER DATABASE [$(dbname)] SET READ_COMMITTED_SNAPSHOT { ON or OFF } ;
    ALTER DATABASE [$(dbname)] SET ALLOW_SNAPSHOT_ISOLATION { ON or OFF } ;
    /* ... etc ... */
    

    Notes:

    Edit here 7 Sept 2013 based on Aaron’s comments:

    This snippet works for small, simple databases. Edit the file size and growth settings to appropriate values, but please don’t use the truly awful default values from SQL Server. Avoid relying only on auto grow to size your databases, instead, when possible, allocating a plausible size and growth increment for files. If you can’t predict the size, let the system run for a time, and monitor, until it reaches a steady state for file size. If you control the model DB (if you’re creating in-house databases, for example) then you can adjust that. If you are deploying on unknown servers, then you can’t rely on reasonable defaults. That said, auto growing at a suitable increment is a reasonable fail-safe to prevent running out of room with a hard size limit.

    This method has the advantage of using the default locations for files at the server level instead of hard-coding file paths – especially the cringe-worthy
    N'C:\Program Files\Microsoft SQL Server\…

    If your database requires more files and file groups, you might need to provide a file path, but ideally use SQLCMD variables so that it can be easily repointed.

    Create a Table if it Doesn’t Exist

    Preferred:

    IF NOT EXISTS ( 
        SELECT * FROM sys.tables t
        INNER JOIN sys.schemas s on t.schema_id = s.schema_id
        WHERE s.name = 'dbo' and t.name = 'myTable' 
    )
        CREATE TABLE dbo.myTable (
            col1 int not null,
            col2 nvarchar(50) not null
        );

    Also works:

    IF NOT EXISTS (
        SELECT * FROM sys.objects 
        WHERE object_id = OBJECT_ID(N'[dbo].[myTable]') AND type in (N'U')
    )
        CREATE TABLE dbo.myTable (
            col1 int not null,
            col2 nvarchar(50) not null
        );
    

    Notes:

    Don’t forget the schema. Use two-part names everywhere unless there is a real and compelling requirement not to.

    Management Studio can help generate these IF NOT EXISTS clauses, but the quality varies and some of the code behind that scripting is very old. Still, it can give you a working starting point. In SSMS, go to Tools > Options > SQL Server Object Explorer > Scripting and look for the setting “Check for Object Existence” or “Include IF NOT EXISTS clause,” then script out some objects with the GUI and examine what you get back. Except for procedures, these are sort of OK.

    Create a View if it Doesn’t Exist

    Preferred:

    IF NOT EXISTS (
        SELECT * FROM sys.views v
        INNER JOIN sys.schemas s on v.schema_id = s.schema_id
        WHERE s.name = 'dbo' and v.name = 'myView'
    )
        EXEC sp_executesql @statement = N'CREATE VIEW dbo.myView AS SELECT ''Placeholder'' AS Placeholder' ;
    GO
    ALTER VIEW dbo.myView AS
        SELECT col1 
        FROM dbo.myTable ;
    GO

    Also works:

    IF NOT EXISTS (
        SELECT * FROM sys.views 
        WHERE object_id = OBJECT_ID(N'[dbo].[myView]')
    )
        EXEC sp_executesql @statement = N'CREATE VIEW dbo.myView AS SELECT ''Placeholder'' AS Placeholder' ;
    GO
    ALTER VIEW dbo.myView AS
        SELECT col1 
        FROM dbo.myTable ;
    GO

    Notes:

    Don’t forget the schema.

    This looks odd at first, but is wonderful to use. It mimics the “CREATE OR ALTER” feature in some other dialects of SQL.

    Create a Procedure if it Doesn’t Exist

    Preferred:

    IF NOT EXISTS (
        SELECT * FROM sys.procedures p
        INNER JOIN sys.schemas s on p.schema_id = s.schema_id
        WHERE s.name = 'dbo' and p.name = 'myProc'
    )
        EXEC sp_executesql @statement = N'CREATE PROCEDURE dbo.myProc AS SELECT ''Placeholder'' AS Placeholder' ;
    GO
    ALTER PROCEDURE dbo.myProc AS
        SELECT col1 
        FROM dbo.myTable ;
    GO

    Also works:

    IF NOT EXISTS (
        SELECT * FROM sys.objects 
        WHERE object_id = OBJECT_ID(N'[dbo].[myProc]') AND type in (N'P', N'PC')
    )
        EXEC sp_executesql @statement = N'CREATE PROCEDURE dbo.myProc AS SELECT ''Placeholder'' AS Placeholder' ;
    GO
    ALTER PROCEDURE dbo.myProc AS
        SELECT col1 
        FROM dbo.myTable ;
    GO

    Notes:

    Don’t forget the schema.

    Like views, this rather strange one imitates “CREATE OR ALTER.” That allows you to keep one source file, in source control, for example, and update it without changing from CREATE to ALTER or even worrying about it.

    Create a Function if it Doesn’t Exist

    IF NOT EXISTS (
        SELECT * FROM sys.objects 
        WHERE object_id = OBJECT_ID(N'[dbo].[getFunky]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT')
    )
        EXEC sp_executesql 
        @statement = N'CREATE FUNCTION dbo.getFunky () RETURNS TABLE AS RETURN SELECT ''Placeholder'' AS Placeholder' ;
    GO
    ALTER FUNCTION dbo.getFunky ( @someParameter int )
    RETURNS TABLE
    AS
        RETURN
            SELECT col1 
            FROM dbo.myTable
            WHERE col1 = @someParameter ;
    GO
    

    If a Primary Key is Missing, Add it

    IF NOT EXISTS (
        SELECT * FROM sys.tables t
        INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
        INNER JOIN sys.indexes i on i.object_id = t.object_id
        WHERE i.is_primary_key = 1 
        AND s.name = 'dbo' AND t.name = 'myTable'
    ) 
        ALTER TABLE dbo.myTable
        ADD CONSTRAINT PK_myTable PRIMARY KEY /* { maybe CLUSTERED } */ (
            col1
        ) WITH ( ONLINE = ON /* or OFF */ )
        /* perhaps ON [myFileGroup] */ ;
    

    Notes:

    This snippet looks for any primary key at all. That’s because you often cannot trust the name of the primary key, if it was auto-generated by the server at some point in the past. It’s good practice to name constraints, but not everyone does. I actually have some code that says “if the primary key is present but not named {x} then sp_rename it to {x}.” But that’s a little DBA-nuts.

    If a NonClustered Index is Missing (by Name), Add it

    Preferred:

    IF NOT EXISTS (
        SELECT * FROM sys.tables t
        INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
        INNER JOIN sys.indexes i on i.object_id = t.object_id
        WHERE s.name = 'dbo' AND t.name = 'myTable' and i.name = 'myNCI'
    ) 
        CREATE NONCLUSTERED INDEX myNCI 
        on dbo.myTable ( 
            col2 
        ) WITH ( ONLINE = ON /* or OFF */ ) ;

    Also works:

    IF NOT EXISTS (
        SELECT * FROM sys.indexes 
        WHERE object_id = OBJECT_ID(N'[dbo].[myTable]') AND name = N'myNCI'
    )
        CREATE NONCLUSTERED INDEX myNCI 
        on dbo.myTable ( 
            col2 
        ) WITH ( ONLINE = ON /* or OFF */ ) ;
    GO
    

    Notes: this assumes that you trust the name of the target index, and that if present the structure of that index is assumed to be correct. It doesn’t check the columns in the index.

    Conclusion

    I do wish SSMS had better and more sophisticated support for scripting these clauses, but, lacking that, I resort to a small library of these and look them up. I’ll add to this page as I locate more of these – but contributions are certainly welcome!

    * Some people and organizations use various tools to help automate this, but my real-world experience tells me that old-fashioned scripts are still a huge proportion of database installation and patch procedures.

  • SQL Server File Layout Viewer 1.1 Release

    Edit 9 October 2013: A new version is out! Please download 1.2 from here

    Back in the Spring I released a 1.0 version of a utility that will show the structure of SQL Server files graphically. Today I have a new release that includes some fixes and updates based on user feedback, bug reports, and suggestions.

    In this release:

    1. The database scan is significantly faster because it now uses parallelism.
    2. Larger databases are supported, and files over 4GB are now read correctly – but don’t go too crazy. Really large data sets are not likely to work.
    3. SQL Server Auth is now available.
    4. Various bug fixes around pages that contained data at one time but are not currently allocated, and pages “captured” from other databases via instant file initialization.

    Fundamentals about the app are the same:

    1. The app will read a database and then present a color-coded visualization of all the data by object and by page type.
    2. Mouse over the visualization to get details about the pages in a panel on the right edge of the app window.
    3. Use this in test, not production. It’ll scan the whole database, which probably would be performance impacting.
    4. This is designed to study and learn about storage internals using modest data sets. Running it on a large warehouse, for example, is probably not going to work.

     

    SFLV_1_1Screenshot

     

    I want to thank those in the SQL community who helped with this project via Twitter, testing for me, bug reports and critiques. You know who you are, and y’all rock :-).

  • STOP! Wrong Server!

    Some time ago I wrote a post about old-time T-SQL flow control. Part of the idea was to illustrate how to make scripts abort on errors, which is surprisingly convoluted in T-SQL. Today I have a more specific example: how can I check that we are really about to execute in the right database on the right server, and halt if not. Perhaps you’ve connected to the wrong server, or the database you intend to modify doesn’t have the name your script expected. “USE database” is wonderful, but what if it didn’t work?

    As in the old article, there are two problems: 1. How do I stop inside the batch that is currently running (i.e. between “GO” batch terminators). 2. How do I stop the client from merrily sending the rest of the batches in a whole script to the SQL Server.

    There are a number of ways to do this, but there is one method I have found to be clear:

    1. Run in SQLCMD mode, in order to take advantage of :ON ERROR EXIT. This stops the client from transmitting more SQL batches after a batch has thrown an error.
    2. Wrap a check for the server name and database name in a TRY/CATCH at the top of the script. For most practical purposes this can be the first batch, only, since we are planning to halt if the test fails. For versions 2005 to 2008 R2, this has to use RAISERROR, but for 2012 (only) we can finally take advantage of THROW.
    :ON ERROR EXIT
    
    BEGIN TRY
        IF ( SELECT @@SERVERNAME ) != 'TheRightServer' RAISERROR( 'WHOA! WRONG SERVER!', 16, 1 );
        USE [master]
        IF ( SELECT DB_NAME() ) != 'TheRightDB' RAISERROR( 'WHOA! WRONG DATABASE!', 16, 1 );
        PRINT 'Rest of First Batch Ran';
    END TRY
    BEGIN CATCH
        IF ERROR_MESSAGE() = 'WHOA! WRONG SERVER!' RAISERROR( 'WHOA! Caught: WRONG SERVER!', 16, 1 );
        ELSE IF ERROR_MESSAGE() = 'WHOA! WRONG DATABASE!' RAISERROR( 'WHOA! Caught: WRONG DATABASE!', 16, 1 );
        ELSE RAISERROR( 'An exception was thrown in the first batch', 16, 1 );
    END CATCH
    GO
    
    PRINT 'Second Batch Ran';
    GO
    

    Breaking this down – if I am in the wrong server or the wrong database, what I want to have happen is that the statement

    PRINT ‘Rest of First Batch Ran’ ;

    should not execute. The way to accomplish that is to raise an error if we are in the wrong server or database, and use the TRY/CATCH to transfer execution down to the CATCH block. A sysadmin could raise a more severe error that would stop batch execution, but that’s less friendly, and we can’t assume we are all sysadmins. To me this method needs to work with error severity 16, for a typical user.

    Further, it’s important that remaining batches in the whole script are not executed. (Remember, batches are separated by “GO” and are each like a mini program sent from the client to the server.) That means that

    PRINT ‘Second Batch Ran’;

    should not execute if the first batch threw an error.

    The only practical way to do that is with the

    :ON ERROR EXIT

    directive, but that only works in SQLCMD mode.

    For SQL Server 2012, and in the future “and higher,” this example can be simplified using THROW, as:

    :ON ERROR EXIT
    
    BEGIN TRY
        IF ( SELECT @@SERVERNAME ) != 'TheRightServer' THROW 50000, 'WHOA! WRONG SERVER', 1 ;
        USE [master]
        IF ( SELECT DB_NAME() ) != 'TheRightDB' THROW 50000, 'WHOA! WRONG DATABASE!', 1 ;
        PRINT 'Rest of First Batch Ran';
    END TRY
    BEGIN CATCH
        THROW;
    END CATCH
    GO
    
    PRINT 'Second Batch Ran';
    GO
    

    Safe computing!

  • Quick and Dirty PowerShell SQL Server Load Test

    Today I wanted to share a very short PowerShell script that can be used to simulate running multiple clients against a SQL Server, for simple performance or hardware testing. One thing I love about PowerShell is that so many functions and features are pre-built and easily accessible that I have to write very little code to accomplish a lot. This script uses the background job feature to run asynchronous processes, and the SQL features to invoke scripts stored in files, and I didn’t have to engineer any of that, which is cool. It is out-of-the-box PowerShell 2 and SQL Server 2012 at the client, but would work against basically any version of SQL Server at the server side. Not bleeding edge at all.

    # Start  clients running the same SQL script simultaneously
    
    $numWorkers = 10
    
    $startTime = ( Get-Date ).AddMinutes(1)
    
    $jobscript = {
        param( $startTime )
             
        while ( ( Get-Date ) -lt $startTime ) { Start-Sleep -Milliseconds 100 }
        
        Invoke-Sqlcmd -InputFile 'path to your\sqlscript.sql' -ServerInstance yourSQLServer -Database yourDB -QueryTimeout 0  
    }
    
    (1..$numWorkers) | foreach {
        Start-Job $jobscript -ArgumentList $startTime -InitializationScript { Import-Module "sqlps" -DisableNameChecking }
    }

    The script does just one thing: given a SQL script stored in a file, it will start some number of simultaneous connections running the SQL against a server. Let’s break it down a bit to see how.

    First we establish the number of client connections to use, and identify a single start time for them to begin running, one minute from the current time. This extra minute will give all the threads time to get created/configured so that later we can cause them to start at approximately the same moment:

    $numWorkers = 10
    
    $startTime = ( Get-Date ).AddMinutes(1)

    Next, we need a “mini PowerShell script” inside this script to pass to an async job to run. This “mini script” is actually what the multiple worker threads execute. There are a few ways to make this, but a simple way to handle it is just to assign a code block { } to a $variable:

    $jobscript = { work for the async job goes here }

    The background job feature has some behavior that is not completely intuitive – for one thing, variables in your calling script won’t be accessible from a running job, because the job runs in a completely separate session/environment. For that reason, you have to explicitly pass in any values that you need in the code that the job runs. This is the reason for the first line inside the job script:

    param( $startTime )

    Using this parameter, we can pass the single start time value from the outer script to all the background jobs. Without it the jobs could not “see” the value.

    Next, the body of the job script is simply “wait for the start time, then execute this SQL script”:

        while ( ( Get-Date ) -lt $startTime ) { Start-Sleep -Milliseconds 100 }
        
        Invoke-Sqlcmd -InputFile 'path to your\sqlscript.sql' -ServerInstance yourSQLServer -Database yourDB -QueryTimeout 0  

    Note: Invoke-Sqlcmd is not loved by everyone, and it has some limitations, but it’s there and mostly works. If you are on pre-2012 SQL client tools, beware the bug for –QueryTimeout 0 and long-running queries.

    That is all the the code we need to run within the background jobs.

    Lastly, we need to spin those background jobs up. This funny thing generates a list of numbers, 1 to 10:

    (1..$numWorkers)
    That list is piped to a foreach loop, which just drives the loop to run 10 times. For each iteration of the loop, we create a background job set to run the code in $jobscript:
        Start-Job $jobscript -ArgumentList $startTime -InitializationScript { Import-Module "sqlps" -DisableNameChecking }

    This statement again shows some of special behavior of background jobs – first, you have to explicitly pass in any values using the –ArgumentList parameter, because the scope where the job runs cannot “see” your other code or variables. This is how to get the startTime value into all the jobs.

    Second, the background job won’t have your PowerShell profile. In my profile, I load the sqlps module, which is where Invoke-Sqlcmd lives. Since my profile doesn’t load for the job, the job would not have that module available, and would not know what “Invoke-Sqlcmd” is. Fortunately, the Start-Job cmdlet has an –InitializationScript option that allows you to pass a second code block that runs before the real work in the job. This is a good place to set up or configure the job, in this case loading the required module.

    The last item you’ll need for this is the actual SQL script that has your testing logic, like running resource-intensive queries in a loop. I will leave that as an exercise for the reader.

    Once the jobs are running you can obviously watch PerMon or use other diagnostics to see how the subject server is handling the load. Back in PowerShell, you can use the cmdlets Get-Job, Receive-Job, Stop-Job and Remove-Job to manage or view the jobs created above.

    I used this to test a new server yesterday with a PCI flash storage card and drove it to 163,000 real, SQL Server, random read IOs/sec, which was pretty cool. Happy load testing!

  • File Layout Viewer vs. Drop Clustered Index

    I had a very educational exchange the other day on Twitter (also known as “getting schooled”) by my SQL compatriots Argenis Fernandez (@DBArgenis) and Hugo Kornelis (@Hugo_Kornelis). A guy called Chris Switalski (@crswit) asked a really interesting question on #sqlhelp, which I proceeded to answer incorrectly, which led to a correction by my friends, which in turn took me back into the SQL File Layout Viewer, which quickly uncovered a bug there.

    So, learn from your friends on Twitter!

    Anyway, here’s the question, and corrected demo using an updated SQL File Layout Viewer build.

    Question (paraphrasing):

    When you create a clustered index on a heap, the data from the heap is rewritten and ordered, and the upper levels of the index are added. If you drop that clustered index, what happens? Does the set of leaf-level pages stay in place, and the index pages just get unlinked and deallocated? Or is the data rewritten again? Does it stay in the same logical or physical order, even though the ordering requirement is removed?

    I had said that I imagine the data is rewritten in both cases, but that we could look and see to be sure. I was totally wrong – the leaf level of a clustered index is made of data pages, and a heap is made of data pages too, as Argenis pointed out. It turns out that, since they are not different structures down at that level, in order to drop a clustered index, the server will indeed just remove the upper levels of the index and leave all the leaf level data pages in place. Cool optimization! It turns out that my original release of the File Layout Viewer doesn’t always display this situation correctly, so I had to make some changes. This demo, though, I believe shows the behavior correctly.

    Answer:

    First, we make a test database and create a heap, and populate it with some dummy data:

    USE master
    IF EXISTS ( SELECT 1 FROM sys.databases WHERE name = 'HeapTest' ) DROP DATABASE HeapTest;
    CREATE DATABASE HeapTest;
    GO
    
    ALTER DATABASE HeapTest SET RECOVERY SIMPLE;
    GO
    ALTER DATABASE HeapTest MODIFY FILE ( NAME = N'HeapTest', SIZE = 102400KB );
    GO
    ALTER DATABASE HeapTest MODIFY FILE ( NAME = N'HeapTest_log', SIZE = 10240KB );
    GO
    
    USE HeapTest;
    
    CREATE TABLE SampleData ( 
        Data uniqueidentifier DEFAULT NEWID(), 
        Padding char(100) DEFAULT ( CAST( CHECKSUM( NEWID() ) as CHAR(100) ) ) 
    );
    GO
    
    INSERT SampleData DEFAULT VALUES;
    GO 100000

    If we run the File Layout Viewer, we see a nice, contiguous set of data pages (type 1) in teal:

    FLVDropClust01

    Next, we’ll add a NON clustered index, just so we can track what happens to that structure too:

    CREATE NONCLUSTERED INDEX Nonclust ON SampleData ( Padding );
    

    That shows up in the next part of the data file, in bright blue:

    FLVDropClust02

    Now if we make a clustered index on the heap, it will have to be rewritten into another region of the file, so that it can be stored in the right logical order. We should expect it to “move” in the visualization down into the empty part of the file. The non clustered index will also be rewritten, because it will switch from using physical row locations to point to data in the heap to using the clustering key to point to data in the new clustered index. Let’s see what it looks like:

    CREATE CLUSTERED INDEX Clust ON SampleData ( Data );
    

    The new clustered index and new version of the NON clustered index are both created in another region of the data file:

    FLVDropClust03

    You can see, in lighter gray, the regions that were emptied where the two structures had been stored, and the two new structures now in the center of the graphic.

    Now for the part of the question I got wrong: what happens when we drop that clustered index?

    DROP INDEX SampleData.Clust;
    

    This is really neat: the leaf level of the clustered index (remember data pages are data pages) stays right where it is, and the upper level index pages are just deallocated/unlinked.

    The NON clustered index does have to be rebuilt yet again to change those clustered index keys out for row location pointers that point to the rows in the heap, so those pages do get rewritten into yet another region of the file:

    FLVDropClust04

    So, to summarize:

    When you add or remove a clustered index from a table, any NON clustered indexes always have to be rebuilt - and will be automatically - to exchange row pointers for clustering keys in the non clustered index itself. This generally takes care of itself, but it can be a performance issue, or cause longer running transactions or a large amount of logging.

    When you CREATE a clustered index on a heap, the data has to be rewritten to enforce the logical order of the index.

    When you DROP a clustered index, however, the server can keep all the leaf level pages in place and just strip away the index levels above.

  • Telecommuting, Month 9

    Five Things I’ve Learned About Surviving the Telecommute

    Over the past several years my wife and I went through this strange series of events where we swapped work roles – first she worked remotely for a Minnesota company, from our home in Seattle, and then we moved to Saint Paul, Minnesota and I started working remotely for my company back in Seattle. Our story is a lot like what you read about remote work, but it’s different to live it.

    Backstory

    We lived on Vashon Island in Puget Sound, near Seattle, for about eight years. The small company my wife works for relocated from there to Minnesota, but she kept her job, working remotely as part of what became a virtual corporation. They were a very small company, but tech-savvy enough that, while not an IT business, remote work seemed like a natural arrangement. I think that shows just how far that idea has come. Like many U.S. families, we video conference with the Grandmas and Grandpas spread around the country, which she effortlessly applied to work, too. Her little company used “cloud” services without really thinking about that idea as anything new or special. It was just a natural fit. They valued her, she valued them, the tech was pretty easy. It was all win.

    Fast forward a few years and our family was looking to relocate, too. Her company was not really the destination – we needed mainly to be closer to our extended families. “Closer” today often means flight time and number of connections (a.k.a. airport) more than physical proximity. Our families are spread all over the eastern half of the U.S. so moving to one or the other city was something we considered, but wasn’t really the only solution. We have kids, but we are also city people and we both really wanted the culture and amenities of a real city and a good, safe urban environment for our boys. Not too many cities have that combination.

    “How the heck did you pick Saint Paul, MN?” is a question I get all the time. It’s hard to explain, but take the airport, proximity to East Coast, South and Northwest by one direct flight, the truly outstanding elementary school in our new neighborhood, the old Victorian gridiron neighborhoods of Summit Hill/Macalester/Groveland, a good friend from college, no fear of snow, and toss them all together in a geography salad, and that was our destination.

    By freak accident, or fate, it’s also not far from my wife’s workplace.

    Switcheroo

    We moved and we both kept our jobs, which was amazing. My wife is able to go into the office a few days a week now, though it’s a long drive, and I started telecommuting to my job in Seattle from Saint Paul.

    Both of us have experienced the usual pros and cons of telecommuting. By now some of these things are well documented, but I think it’s hard to see how important those issues are until you experience them firsthand, either as a remote employee or working with a remote employee, and over a long period. It’s great, and it’s difficult at the same time.

    I will say my employer has been really wonderful throughout this process – they are supportive, and they had to go through extra effort to make this happen for me, which they certainly didn’t have to do. We had some tentative few-days-a-week sort of telecommuting for a while, but I think I am the first true full time, long distance, different-time-zone sort of remote worker, with all the challenges that come along in that scenario. Because there wasn’t really an established environment used to remote work, with habits and technology and processes in place, we’ve had to work some things out as we go.

    In fact every aspect of it, for me, has proven a double-edged sword, and I wanted to spend some time talking about why.

    Coffee Shop? No.

    First I have to get one myth out of the way: working in a coffee shop is OK -- if you want to work in an uncomfortable, noisy place where you can’t use the phone, the network is dodgy, your screen’s too small, coffee costs $4, and there’s no privacy. It only sounds romantic. Honestly.

    I work from a dedicated space in my house in order to be focused and minimize distractions, and that has worked well. I’d say to even contemplate working from home you need a separate space, with a door you can close. The attic space where I currently work is great in terms of creating a separate work environment in the house. I have a real desk and an IP phone connected straight to the office. The space has some other issues, notably not being a layout where office furnishings, network and power work well, but we are moving to a permanent home a few blocks away, and in purchasing that house the “office space” was one of the main purchase criteria. We should be into the new place in late June.

    On the other hand, working alone in a home office has some real social challenges, as I’ll talk about more below. It’s quiet. Really, really quiet, which can be too much of a good thing.

    Losing the Commute

    I have always – so far, anyway – avoided a typical driving commute. Spending hours on the freeway in a car is so awful, for me, I have always orchestrated living and working so as to avoid it, by living close to my office or to transit. Eliminating the commute altogether is obviously a fantastic perk of telecommuting. I am eternally grateful for the hours I’ve gotten back.

    And yet.

    Here’s the strange thing, which others have talked about too: if you are the type of person for whom remote work will be successful, you probably are the type who has a passion for work, where work isn’t just time spent from eight to five. Passion for work is something that creates the discipline to work consistently without explicit direction, and to produce all on your own, regardless of schedule.

    Those same traits also make the removal of the commute problematic. Life and work can start to merge together, and it isn’t always pretty. Leaving the house, getting in the car/bus/train/ferry and being forced to sit there, even for twenty minutes, is an important psychological boundary between work and life, where you – even subconsciously – leave one and get mentally prepared for the other.

    This is something I really struggle with. I have a tendency to obsess about work anyway, and when it’s always right here, that can be a real problem. It’s particularly bad during the times when work is stressful. Working from home demands that your team evaluate you on your output as much as just your presence, but your perceived presence looks irregular, and that can cut both ways.

    Isolation

    There are huge benefits to being out of the daily fray in the office, mostly in terms of productivity on heads-down work. Interruptions are minimal. I used to use the “Pomodoro” technique in the office, working in 20-25 minute sprints, but it was mostly as a survival technique against the onslaught of meetings, and email, and walk-up requests. I find in my new situation that I don’t have to do that, because finding solid blocks of time to do real work isn’t so much of a problem.

    The other refreshing thing is that one important type of work – thought – is OK. I used to be in a cube, in an open office. One challenge was that some parts of my job require me just to think. Think about strategy. Think about how to design against some complex problem. Read books. When you think, you don’t necessarily look busy. It’s productive time, but you aren’t producing a thing someone can point at. So, time to think through problems without some pretense of tapping at the keyboard is really valuable. In the office I used to block time out and walk on a nearby bike path in order to have this time without needing to look “busy.” Now I can do this in the house, which is great.

    Communication with my team really is a challenge. I have three-times-a-week calls with my team lead, and daily short calls with my team just to check in. It sounds odd, but these are a real requirement just to help offset the lack of hallway conversation that normally takes place in the office. If you are thinking of working remotely, I would advise making the extra effort to schedule time like this, because even though it sounds trivial, it’s really important to keep everyone on the same page.

    The biggest challenge of all is missing on the rest of the hallway or cube/desk decision-making that can go on. I always had my ear to the ground in the office, and viewed part of my DBA role as trying to steer everyone to the right decisions about data and SQL Server, even when it meant heading something off that I overheard in the next cube. For better or worse, that part of my job isn’t possible remotely, and that can be a challenge.

    But I’ll also lay this right out there: it’s lonely. At different times in life we have different social structures to rely on for friendships. There’s high school, college, church -- before kids my wife and I used to have dinner with friends or have the occasional party. If you’re 40+, with kids, work is a main source of social interaction and friends. There aren’t that many other social structures like it. Moving to a new city and essentially breaking that social interaction with my coworkers is tough, and there isn’t much to replace it.

    Social media (and I’m hooked on it) doesn’t fill that void, I’m afraid. It naturally took over more of my free time with this move, but as months go by I realize there’s something depressing about it, and I need to limit the time I spend online.

    Children

    Some things about remote work have helped our family life, but mostly it relates to basics like flexible hours and the time recovered from not spending an hour a day on the road. If you think that working remotely will help with child care expenses, I’d say stop – that’s unlikely. It’s not possible, for me anyway, to actually care for the kids in any reasonable way and work at the same time. There’s no way. I think I could do a poor job with work and a poor job watching them, and that serves nobody well. We have just as many childcare arrangements and expenses as if I drove to work.

    But I do love the little perks. Many mornings I get to eat breakfast with the kids and put the oldest one, who is in second grade, on the bus. My wife and I have lunch dates, usually once a week, on a day we both work from home and the kids are at school.

    On the other hand, there are times when I have to explicitly stop my wife or my kids and just say, “I really can’t do x because I am working. Truly.” And it is easier for them, just because I’m in the house, to ask or stop me, “Honey can you stop and …” Having my wife have work-from-home experience too really helps, but this can be a problem (both directions, I am guilty too).

    For better or worse, when we really have an issue with the family, which sometimes comes down to just not having enough hours in the day, my wife or I can work in the evening to make up time or meet deadlines, and that is definitely a major benefit. It can, though, aggravate the whole problem of not leaving work at work and being free of it at home.

    Tech, Tools and Email

    This is the part all us gadget freaks gravitate to, I think – what tools make work-from-home possible, or make it better. Here the news is also mixed.

    First, I. Love. Email. To a fault.

    Imagine a medium where multiple people can collaborate on a solution and everyone is informed, where the outcome and the trail of decisions is documented in writing automatically, and the participants can all work on it asynchronously, without being interrupted. In a perfect world, it seems like email would almost do that. (I dislike the phone for all the inverse reasons.)

    And yet.

    Working remotely makes it so easy to fall into an email-only mode of communication, and while I love that, it flatly doesn’t work well for some types of activity. For one, I find it almost impossible to persuade some individuals to do something through email. To act. I am very sorry to report that that still often requires a phone call or, ack!, a full-on meeting, with a manager, because email simply goes ignored. Among those that like email, email works, but with others, sometimes not. Fortunately my team lead understands this well, and I know I can ask him to set up that meeting or that call when we have to go that route. Email can also foster miscommunication in some circumstances, especially relating to tone and personal interaction. This email thing is clearly a habit I have to be careful about.

    What about all those other tools? Tele-presence and all the funnily-named collaboration do-dads like Skype, Trello, Dropbox, Toggl and Producteev and Flow and Remember the Milk and Google Hangouts and Asana?

    There is, unfortunately, a sobering reality here. The number and variety of online collaboration tools really has exploded recently with the growth in remote workforce and virtual organizations. But it’s been my experience that a company probably has to be young (in attitude, not necessarily age), nimble, small, has to be engaged in ideas about the modern workforce, willing to try new things, and has to deal with few secrets to contemplate using most of these services. Otherwise the obstacles are many: change, security, habit, ignorance, organizational inertia, isolation, policies, firewalls both real and imagined.

    If you are in a company that is bigger, a company with security concerns, a company that cannot change course quickly, then it seems likely you might get the right to work from home but not get access to any of these cool new tools.

    So, I have to say, in my case I have Lync (barely) for desk or app-sharing, and an IT group that is just now getting the hang of that, but no video, no tele-presence, no novel SaaS online project management whiz-bang anything, funny names or no. And I suspect that the work-from-home crowd may fall into two groups – those with the new tools in small companies or start-ups, and those in established companies like me trying to make it work with just RDP and the phone.

    I don’t think this is related to my company’s commitment to my success working remotely, or to their support of it – it’s simply too hard to overcome the organizational inertia, which will be true in many organizations.

    But with some good processes and habits, I think the minimal-tools predicament can still work. Process and habit, I believe, are probably 85% of the battle, and while coolio tools might aid that, I don’t imagine they are a substitute for it.

    I do get jealous of my wife on this front, though, because a small company on a shoestring budget can do more with cloud-based applications like this than a bigger, well-capitalized one. Ironic.

    Conclusions?

    I think of this as an ongoing experiment. I love my job as a DBA – though, to be sure, every job comes with some challenges – and in many ways it’s suited to remote work. I am finding it harder than I imagined, mostly because it’s such a solitary existence, and because the lines between work and home are so blurred as to be almost erased. I have to keep changing techniques and trying different things to make it work well.

    It’s become clear that I should not wait for the next “teh aw3some” tool to come around, and instead focus on basics like process, communication, and habit to make the fundamental tools of RDP, phone and airplane work to their fullest.

     

    P.S. I am grateful to SQL community friends Tom LaRock, Mike Walsh, and Aaron Bertrand for part of this remote work adventure.

    Edit 23 April: A friend of mine reminded me how important it is to go back into the office and talk with people face to face. I thought I should add, since I didn't make this clear, that I do go back for three days or so every other month. That's vital to this whole arrangement. I try to pack those visits with meetings just so I can talk with people and get reacquainted. 

     

  • Speaking in Chicago Saturday

    I’m excited to be offering a beginning PowerShell session at SQL Saturday #211 in Chicago on Saturday, April 13. This time we’re making it a family weekend, bringing our two boys. I haven’t been to Chicago for many years, and it’s, of course, an Architecture dream world, so it should be fun to look at some buildings again!

    Emil_Bach_House

    Emil Bach House, photo © Jeremy Atherton, 2006

    The session is intended to help you get started with some PowerShell syntax basics. I try to demystify the Pipeline a bit, help make sense out of all the { curlies } and [ brackets ] and ( parentheses ), and translate some of what you already know about T-SQL into PowerShell. If PowerShell seems cryptic or mysterious, come on down and maybe we can sort it out together.

    Should be a great time.

  • SSMS 2012 Restore GUI Gotcha

    Today I want to bring to your attention an issue in the SQL Server Management Studio 2012 restore GUI. In many ways the new restore dialog is nicer than the old one, with new features and added convenience – but, as is always the Achilles heel of GUI tools like this, if you don’t know what’s really going on it can bite you. I’m not sure what to call this issue, maybe just a UI design flaw. Technically it works as designed, and there’s nothing really wrong with it, so it’s not a bug. But I can imagine it really causing someone pain who is careless or doesn’t know what’s happening behind the scenes.

    Restoring a copy of a database make take the original down.

    Among the new features is some management of “Tail-Log Backups,” which is a wonderful idea. Taking a backup of the tail of the log will take down the database and put the whole end of the log into a backup file, preserving basically all modifications to the DB from the log. This is ideal, for example, at the last step of migrating a database using log shipping, because all the activity is preserved and the source database is locked against further modifications. The log chain is preserved in moving to the new copy of the database, or in a DR scenario, to your other server.

    The problem is, I know a lot of people don’t know this. Often its people who use the GUI a lot, such as “accidental DBAs.” I think the GUI is really important despite those who rarely use it, or frown on it.

    Here’s the issue: Suppose I have a production database that has some user-created problem (like a user accidentally updated some rows.) It may be reasonable to restore a copy of the database to a point before the issue and investigate whether it’s possible to merge the data back in. The original database isn’t damaged from a technical point of view – there’s no corruption, for example.

    1. Right-click the original source database and choose Tasks > Restore > Database … from the context menu.
    2. The resulting dialog will go out and discover the backup files for that database from MSDB, which is very helpful.
    3. Change the name of the target database, in order to restore it as a copy.

    RestoreGUIIssueCap1

    There’s a note at the top of the dialog that, while accurate, may not help some in this scenario. It says, “A tail-log backup of the source database will be taken.” Sounds innocuous if you don’t know what that implies. In past versions, restoring a copy of a database would never affect the original – or not by default, anyway – so I cringe as I imagine people merrily clicking past this warning.

    The script you get with these settings is indeed composed with a tail-log backup of the source database:

    RestoreGUIIssueCap2

    That will, as indicated, take a tail-log backup – and thereby take the original database down. Nice if you wanted that, but a mean surprise if you didn’t.

    If you act on the warning and click the Options tab, and then uncheck the offending setting, you do get the expected behavior (the original database is unaffected because the tail-log backup is not included at the top of the script):

    RestoreGUIIssueCap3

    So, be careful out there!

  • T-SQL Tuesday #040: Files, Filegroups and Visualizing Interleaved Objects

    Early in my career as a DBA, I have to admit I didn’t quite “get” what all the fuss was about with the multiple file and multiple filegroup capability in SQL Server. Over the years, though, as I learned more about partitioning, backup strategies for large databases and, most importantly, storage internals I’ve started to catch on. For today’s T-SQL Tuesday, I thought I would share an early lesson, using the newly released SQL File Layout Viewer utility available here.

    tsql2sday

    Adam Machanic started T-SQL Tuesday several years ago and it’s been a great way to motivate bloggers and readers to share ideas around a topic. This month’s even is hosted by Jen McCown / MidnightDBA. An awesome idea, as ever.

    What I hope to illustrate today is a simple but vital concept about files and file groups: files, on their own, use a proportional fill algorithm in SQL Server and will receive new data from any object that needs more space. If an object needs additional storage, an allocation will be made in an empty area of whichever file has the most space available. Under the multi-file feature alone, there is no attempt whatsoever to separate objects in the database logically into different files. The result is that objects can appear to “round robin” through the files as more space is allocated – it’s not really a round-robin algorithm, but if the files are all about equally full, it looks similar.

    Filegroups, on the other hand, allow logically grouping objects in a way that specifically directs them into separate files, and so controls the physical layout of those objects on disk. Using filegroups it’s possible to direct an object into a file (or set of files) and physically separate it from the rest of the database.

    Because of this distinction, it’s been my experience that filegroups are far and away the more useful of the two features, and if I have a database that merits splitting up, I go straight for filegroups, with one physical file per group. (There are some edge case exceptions like TempDB allocation contention that are outside the scope of this post.)

    Interleaving

    There are a few reasons we’d want to physically separate objects, but, because it so clearly illustrates the difference, I want to show interleaving today. I made a demo database using the following script:

    USE [master]
    GO
    
    :setvar datapath "C:\Program Files\Microsoft SQL Server\MSSQL10_50.SQL2008R2\MSSQL\DATA"
    :setvar logpath "C:\Program Files\Microsoft SQL Server\MSSQL10_50.SQL2008R2\MSSQL\DATA"
    
    CREATE DATABASE [VizDemo3] ON  PRIMARY 
    ( 
        NAME = N'VizDemo3', FILENAME = N'$(datapath)\VizDemo3.mdf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    FILEGROUP [MultiFileFG] 
    ( 
        NAME = N'VizDemoMFGF1', FILENAME = N'$(datapath)\VizDemoMFGF1.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    ( 
        NAME = N'VizDemoMFGF2', FILENAME = N'$(datapath)\VizDemoMFGF2.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    ( 
        NAME = N'VizDemoMFGF3', FILENAME = N'$(datapath)\VizDemoMFGF3.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    FILEGROUP [SingleFileFG1] 
    ( 
        NAME = N'VizDemoSFG1', FILENAME = N'$(datapath)\VizDemoSFG1.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    ), 
    FILEGROUP [SingleFileFG2] 
    ( 
        NAME = N'VizDemoSFG2', FILENAME = N'$(datapath)\VizDemoSFG2.ndf' , 
        SIZE = 25600KB , MAXSIZE = UNLIMITED, FILEGROWTH = 51200KB 
    )
    LOG ON 
    ( 
        NAME = N'VizDemo3_log', FILENAME = N'$(logpath)\VizDemo3_log.ldf' , 
        SIZE = 10240KB , MAXSIZE = 2048GB , FILEGROWTH = 10240KB 
    )
    GO
    
    ALTER DATABASE [VizDemo3] SET RECOVERY SIMPLE 
    GO
    
    USE VizDemo3
    GO
    
    CREATE TABLE dbo.SampleCustomers_on_Primary  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON [PRIMARY] ;
    GO
    
    CREATE TABLE dbo.SampleOrders_on_Primary  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON [PRIMARY] ;
    GO
    
    INSERT dbo.SampleCustomers_on_Primary DEFAULT VALUES;
    INSERT dbo.SampleOrders_on_Primary DEFAULT VALUES;
    GO 20000
    
    
    CREATE TABLE dbo.SampleCustomers_on_Files  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON MultiFileFG ;
    GO
    
    CREATE TABLE dbo.SampleOrders_on_Files  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON MultiFileFG ;
    GO
    
    INSERT dbo.SampleCustomers_on_Files DEFAULT VALUES;
    INSERT dbo.SampleOrders_on_Files DEFAULT VALUES;
    GO 20000
    
    
    CREATE TABLE dbo.SampleCustomers_on_FileGroups  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON SingleFileFG1 ;
    GO
    
    CREATE TABLE dbo.SampleOrders_on_FileGroups  ( 
        id int identity( 1, 1 ) NOT NULL PRIMARY KEY CLUSTERED, 
        buncha char(500) default 'A', 
        big char(500) default 'B', 
        vals char(500)  default 'C'
    ) ON SingleFileFG2 ;
    GO
    
    INSERT dbo.SampleCustomers_on_FileGroups DEFAULT VALUES;
    INSERT dbo.SampleOrders_on_FileGroups DEFAULT VALUES;
    GO 20000

    This is a really simple example showing several possible file layouts, and what those do with the physical storage of the data. Starting from the top, note that the database create statement makes three different file setups:

    1. Our old friend Primary – one MDF file to rule them all!
    2. A file group with three physical files: MultiFileFG. These files will be populated with SQL Server’s proportional fill.
    3. Two file groups with one physical file apiece: SingleFileFG1 and SingleFileFG2. This allows the DBA to direct objects into specific files on disk.

    Next I make some pairs of sample tables, and locate a pair in Primary, a pair in MultiFileFG, and a pair (one table each) in SingleFileFG1 and 2. The tables are each populated with 20,000 rows of sample data, arbitrarily called Customers and Orders.

    It’s important to note that the tables are filled with data in an alternating fashion – one table gets a row, then the other table gets a row – and that means that the allocations for space for those objects will also alternate. One table grows, then the other, then the first again.

    If the tables are in the same filegroup, even with separate physical files, the data will be interleaved. There will be a block of pages for one table allocated, then from the other table. Under default settings in SQL Server (not using the –E parameter, for example) the blocks of pages will each be a 64k extent of 8 data pages.

    So, let’s have a look at what happens inside the files, and see if it supports this logic.

    One File

    InterleaveScreenCap1

    My goodness it does! But you probably suspected that, as I’m the guy writing this post.

    First let’s look at the blue area – the stack-of-bricks looking light blue and purple region is the Primary filegroup / MDF file. What we are seeing is the alternating, interleaved extents from the two tables, one in blue and one in lavender/purple. This isn’t terrible, but there are two things worth noting:

    1. There’s a decent amount of fragmentation. Not harmful, necessarily, but these two tables definitely are not contiguous. If we have an application that wants to do a lot of scans, and we have spinning media, we might have some performance degradation. Read-ahead probably doesn’t work. This might not be ideal for a data warehouse.
    2. If we re-index one of those objects the other one will remain full of “holes.” We could re-index both, but it might take a few passes before this interleaving works itself out.

    I want to emphasize here that this isn’t necessarily a terrible situation, but it’s illuminating as far as how the data is physically written out.

    Three Files in a Group

    The next reasonable question to ask is whether making multiple files, alone, might solve this issue. So the next section of the visualizer, scrolling down a bit, shows the one filegroup/multiple file scenario from the demo. This is filegroup MultiFileFG, shown in alternating pink and purple:

    InterleaveScreenCap2

    Notice, we got three files, but the two tables we map into them are still interleaved, now across all three files. (Each file in the visualizer has a signature of orange-gray-orage system pages right at the beginning.)

    This pattern is because of the proportional fill applied to the physical files inside one filegroup. An extent is allocated for an object from one file, then the next file has the most available space, so the next file used for the next extent, and so on. This means our tables are still cut up into 8-page extents spread across the physical media.

    Two Groups with One File Each

    The last section of the demo writes each table into a separate filegroup, where each filegroup is backed by a single physical file. The results here are represented by the solid pink area in the graphic:

    InterleaveScreenCap3

    In this case, the alternating inserts in the script don’t result in alternating allocations inside the files, because one table is specifically isolated to one filegroup and the other table to the other. In this case, the data in the files is practically contiguous.

    Conclusion

    Does this mean you should go out and make 1,000 file groups, one for every table or index? Please don’t! There’s a balance when planning files and filegroups between manageability, imagined performance impact of fragmentation (or real), complexity, and many other factors. SQL Server actually does a fine job, on smaller databases, working with one or a handful of files. But if you are planning a larger warehouse project, or you find the hardware you have struggles with the volume of data you need to manage, or you are looking at partitioning, then understanding how these storage internals behave, along with many other design considerations, will help.

  • Presenting to East Iowa I-380 U.G. March 12

    Tomorrow I’m making the beautiful drive down from Minneapolis into Iowa to present to the I-180 / East Iowa SQL Server Users Group. I’ll be talking about managing lots and lots of unruly databases, and presenting a walkthrough of this maintenance solution. If you’re in the area please come check it out!

  • March Update to Rules-Driven Maintenance

    This month I have a minor update to the Rules-Driven Maintenance code I originally posted back in August 2012. This update has just two enhancements, but they are nice ones, I think:

    1. Much improved handling for instances that use database snapshots.
    2. Ability to do intra-day differential backups purely by setting maintenance times and limits in the policy table.

    The code posted here is cumulative, and replaces entirely the original code, but please refer back to that original blog post for description, instructions and details. This update can be installed right over an existing deployment, or be installed all by itself as a complete solution but – as always – please test and be cautious.

    Enjoy!

    Note: I've posted updates to this solution. Please get the latest version here.

  • Trigger Authoring Decision Tree

    TriggersDecisionTree

This Blog

Syndication

Privacy Statement