SQL Server 2014 introduces updateable nonclustered and clustered columnstore indexes. When new data is inserted, it goes first to deltastore where background process divides it into row groups, indexes and compresses. Indexing progress can be observed via sys.column_store_row_groups DMV. According to BOL, row groups in deltastore can be in one of 3 possible statuses: OPEN (a read/write row group that is accepting new records. An open row group is still in rowstore format and has not been compressed to columnstore format), CLOSED (a row group that has been filled, but not yet compressed by the tuple mover process) and COMPRESSED (a row group that has filled and compressed). COMPRESSED actually means that data is already part of the index. Row groups that are still in OPEN or CLOSED state are yet to be added to the index and currently reside somewhere in deltastore (internal object inside the database). You won’t find it in sys.objects or sys.partitions. But it is there – you can observe it using “half-documented” sys.system_internals_partitions view.
Anyway, performance difference between fully updated columnstore index and situation when part of your data that is still in OPEN or CLOSED state in deltastore is huge. So we would like the background process that indexes and compresses data to be as fast as possible. Unfortunately it is painfully slow. I’ve loaded 45 billion rows into clustered columnstore index. When loading had finished, I had only 13 billion rows in the index – less than 30%. 12 hours passed since then – I continue all the time to observe indexing progress using following query:
3: SUM(total_rows) as TotalRows,
4: SUM(CONVERT(BIGINT, size_in_bytes))/(1024*1024) as SizeMB
5: FROM sys.column_store_row_groups
6: GROUP BY state_description;
Results are very disappointing – less than 1 billion rows in 12 hours. Meaning indexing takes much longer than loading itself. Actually, more than 10 times longer. And querying data which is still in the deltastore is painful even more. Here is estimated plan – couldn’t wait for an actual, stopped the query after 15 minutes (when indexed with fully updated columnstore index, runs under 2 seconds):
Look at the estimated IO and CPU costs – they’re huge. I guess, it means scanning entire deltastore. Which is reasonable – there is no traditional b-tree index on this data (for sure not while in deltastore), columnstore index isn’t ready yet, so scan is inevitable. And even more – I couldn’t find any query that would run fast. That makes sense too – even if required data resides fully inside columnstore index, we can’t be sure about that. What if there is another part that is still in deltastore? It is unsorted and un-indexed, remember? So for any query SQL Server should check deltastore – in case something there is not in “COMPRESSED” status. In my case it means scanning 45B – 14B = 31B rows.
What does it mean? It means that updateable clustered/nonclustered columnstore indexes should be used with extreme caution. If your data is constantly updated (loaded) or if you have huge single ETL but not enough time for a background process to calculate entire index before users start to query it – result can be extremely painful. Don’t test your system only when data is ready and queries run blazingly fast. Test it when deltastore contains some data. I hope, Microsoft would improve performance of the process that indexes and compresses data – in my environment it seems extremely lazy: nothing else runs, no user sessions at all and still CPU is on 1-2% and indexing progress according to sys.column_store_row_groups is very very slow.
P.S. As I’ve blogged earlier, sys.partitions sees only rows which are in “COMPRESSED” status. So sp_spaceused, for example, currently indicates that my table contains only 14B rows.
Update: by request from Kevin Boles in comments below, I’m providing environment metrics.
It wasn’t intended to be a stress test, just turned out that way. The original intention was to compare my client’s application performance in a current version where we store some data in long compressed (proprietary + zip) blobs vs same data in clustered columnstore index. Updateability is a must, so columnstore index in SQL 2012 wasn’t an option. Data loading had been executed by application that was parsing the blob and calling stored procedure that receives TVP as an input parameter. Every call contained ~5K rows. 3 instances of this “migration tool” were running concurrently. The important part here is that it wasn’t bulk load. Data loading took ~3 days, mainly because of CPU bottleneck in migration application – database could do faster. As I’ve mentioned above, by the end of the loading, when ~45B rows had been inserted into the table, SQL Server had indexes and compressed 13B rows.
Environment is Virtual Machine running Windows 2012 Server, Intel Xeon CPU E5-2670 (2.6GHz) with 4 virtual cores; 6GB of RAM; IO – don’t know what is below the disk that I see but anyway, there is no IO bottleneck.
SQL Server used ~5GB of memory. MAXDOP setting remained default (0). Delayed Transaction Durability was set to “forced”.
I took IO and wait types snapshots several times during the day. IO latencies were ~12ms/read and ~0.5 ms/write. CPU was between 1% and 5%. During 5 minutes between snapshots SQL Server wrote ~144MB and read ~240 MB. Not much. Looks like server is almost idle. Tempdb was idle either - ~1% of total IO activity.
There were only 2 noticeable wait types – between subsequent 5 minutes snapshots they both had 299.98 second wait time (i.e. entire interval): DIRTY_PAGE_POLL and HADR_FILESTREAM_IOMGR_IOCOMPLETION.
Average number of rows per rowstore is ~1.05M. Average row group size is ~9.3MB. All the data resides in a single partition – as it is single partition in the current system (although with long blobs it is only 10M rows instead of 45B).
Regarding dictionary usage, sys.column_store_dictionaries contains ~141K rows with entry_count in a range from 1 to 27.5K.
Regarding query: here it is (without table and column names):
1: SELECT AVG(Col5)
2: FROM dbo.Tbl
4: Col1 = @Int1 AND
5: Col2 = @Int2;
I’ve written aside some statistical data during data migration, so I know for sure that this WHERE clause filters 27.5M rows. Same query on a fully updated columnstore index (same number of rows to query but index itself much smaller) on the same machine took under 2 seconds with cold cache.
In my previous post about the subject I’ve complained that according to BOL, this feature is enabled for Hekaton only. Panagiotis Antonopoulos from Microsoft commented that actually BOL is wrong – delayed durability can be used with all sorts of transactions, not just In-Memory ones. There is a database-level setting for delayed durability: default value is “Disabled”, other two options are “Allowed” and “Forced”. We’ll switch between “Disabled” and “Forced” and measure IO generated by a simple workload.
Adding table (and re-creating it before every execution):
1: IF OBJECT_ID('DDTest') IS NOT NULL
2: DROP TABLE dbo.DDTest;
4: CREATE TABLE dbo.DDTest
6: id INT NOT NULL IDENTITY(1,1),
7: col1 INT NOT NULL DEFAULT 100,
8: col2 VARCHAR(100) NOT NULL DEFAULT 'abcd'
We would like to check impact of delayed durability setting on reads and writes to data and log files. So we’ll take snapshot before and after the test and compare. The test itself is very simple – loop that inserts single row with default values into the table – 10 thousand times. Expected results are: at least 10 thousand small log writes when delayed durability is disabled and… hopefully something interesting when it is set to “Forced”.
1: SELECT * INTO #Before FROM sys.dm_io_virtual_file_stats(DB_ID('DDTest'), NULL);
5: SET NOCOUNT ON;
7: DECLARE @i INT =1;
9: WHILE @i <= 10000
11: INSERT INTO dbo.DDTest( col1, col2 )
12: VALUES(DEFAULT, DEFAULT);
14: SET @i += 1;
19: SELECT * INTO #After FROM sys.dm_io_virtual_file_stats(DB_ID('DDTest'), NULL);
Following query calculates IO between snapshots:
1: WITH Diff AS
4: b.database_id, b.[file_id],
5: a.num_of_reads - b.num_of_reads as num_of_reads,
6: a.num_of_bytes_read - b.num_of_bytes_read as num_of_bytes_read,
7: a.num_of_writes - b.num_of_writes as num_of_writes,
8: a.num_of_bytes_written - b.num_of_bytes_written as num_of_bytes_written
10: #Before as b
11: INNER JOIN #After a ON b.database_id = a.database_id AND b.[file_id] = a.[file_id]
13: DBIO as
16: DB_NAME(d.database_id) as DBName,
17: CASE WHEN mf.[type] = 1 THEN 'log' ELSE 'data' END as file_type,
18: mf.[name] as [file_name],
19: SUM(d.num_of_bytes_written) as [io_bytes_written],
20: SUM(d.num_of_writes) as [writes],
21: SUM(d.num_of_reads) as [reads],
22: SUM(d.num_of_writes + d.num_of_reads) as [io]
24: Diff d
25: INNER JOIN sys.master_files mf ON d.database_id = mf.database_id AND d.[file_id] = mf.[file_id]
26: GROUP BY
28: CASE WHEN mf.[type] = 1 THEN 'log' ELSE 'data' END,
32: DBName, file_type, [file_name], [io], [reads], [writes],
33: CAST(1.0 * [io_bytes_written]/(1024*1024) as DECIMAL(12, 2)) as io_mb_written,
34: CAST(1.0 * [io_bytes_written]/[writes] as DECIMAL(12, 2)) as io_bytes_per_write
35: FROM DBIO
36: ORDER BY file_type;
Results vary from test to test but just a few bytes give or take. Generally they are very consistent. And positively surprising:
||Avg Bytes per Write
Tried to force checkpoint and measure IO after that – no change, seems like black magic. I expected to see less IO with more bytes per write but generally same volume of writes both to data file and log. But results clearly show that we write less – both to data file and to log. And ~10K average IO size in data file drives me crazy – shouldn’t it be 8K???
Also very important observation although expected one, especially on my laptop with its old hard disk: transaction with delayed durability is _much_ faster. 10K writes, even sequential ones, can make huge difference.
Results are slightly different when switching between Simple and Full recovery models but the trend stays. I guess, it can have something to do with minimal logging. Would be interesting to see whether modifying recovery interval affects delayed durability behavior. And also 2 more tests: stress (many sessions doing insert concurrently) and workload with different IO pattern: would it be that effective on bulk loading, for example. And for sure digging in transaction log using fn_dblog should bring something interesting for internals’ geeks among us.
As you probably know, this DMV is new in SQL Server 2014. It had been first announced in CTP1 but only in BOL. Now in CTP2 everyone can “play” with it. Since BOL is a little bit unclear (understatement detected), I’ve prepared this small FAQ as a result of discussion with Adam Machanic (blog | twitter) and Matan Yungman (blog | twitter).
Q: What did you expect from sys.dm_exec_query_profiles?
A: Expectations were very high – it promised, for the first time, ability to see _actual_ execution plan details and query execution progress in Production systems in real time. For example, to see difference between actual and estimated number of rows per iterator – one of the most important indicators of plan efficiency (or more important – inefficiency).
Q: I executed long and heavy query. And didn’t find it in sys.dm_exec_query_profiles. Actually, DMV was empty. What happened?
A: BOL says: Monitors real time query progress while the query is in execution. The counters are per operator per thread. The data collected is serialized into the SHOWPLAN XML when the query finishes. What Microsoft really wanted to say by it was that SHOWPLAN XML or any other sort of actual execution plan is a prerequisite. If actual execution plan isn’t required for your query, SQL Server won’t gather all this data at the first hand. And won’t show it in sys.dm_exec_query_profiles during runtime.
Q: How can I still use this DMV?
A: You can use it via Management Studio: ask for actual execution plan for your query – and it’ll appear in sys.dm_exec_query_profiles. That’s good for demos or for investigating particular heavy query because at least in CTP2 sys.dm_exec_query_profiles shows more data than SHOWPLAN XML. I hope, Microsoft would close this gap - if data is collected and presented in DMV anyway, there is no reason whatsoever to hide it in actual plan. If you want to be able to monitor progress and measure plan efficiency of all queries running on your server in real time, you can use plan event in Profiler or – the most simple and recommended way in my opinion – just create Extended Events trace with query_post_execution_showplan event and DMV would come to life. If you’re not interested in plan itself but only in DMV, use ring buffer as ax target and small buffer size – this way you won’t waste too much memory and at the same time you’ll spare IO required for writing plans to disk.
Q: So I can just create trace and leave it running for good?
A: Not recommended unless your server isn’t really busy. Event description for query_post_execution_showplan event says: “Occurs after a SQL statement is executed. This event returns an XML representation of the actual query plan. Using this event can have a significant performance overhead so it should only be used when troubleshooting or monitoring specific problems for brief periods of time”. Gathering all this actual execution-related data is resource-consuming process by itself, so it can effectively kill busy server with thousands transactions per second. If you’re interested in exact numbers of throughput degradation as a result of actual plan capturing, excellent benchmark is here (refers to SQL Server 2012 but, I believe, SQL 2014 won’t bring significant enhancements in this field).
Q: Now you’ve frightened me. So your recommendation is not to use in Production environment?
A: Didn’t say that. As Adam pointed, we probably won’t use it in busy OLTP databases but DW-like environments with small number of heavy and complicated transactions seem to be a good candidate for monitoring and troubleshooting using sys.dm_exec_query_profiles. For this matter it could be nice to have trace flag or flag in sys.configurations (i.e. managed by sp_configure). Trace flag could be especially effective because it enables high granularity – only for desired queries – using QUERYTRACEON hint.
Q: But what is so interesting in this DMV?
A: A lot. It presents data per iterator in execution plan and provides 2 distinct categories of counters: query execution related and storage engine related. Examples for the former: row_count, estimate_row_count; for the latter: physical_read_count, write_page_count etc. Queries based on sys.dm_exec_query_profiles deserve separate post(s) – that’s not primary topic of this FAQ article.
Remember the excitement about new sys.dm_exec_query_profiles DMV? It promised to be a gamechanger, providing query visibility at a runtime and easily extractable information about heavy iterators in execution plan. So it has been announced but missing. Now in CTP2 it is finally here. So, singing one of my favorite Queen songs “… It finally happened - I’m slightly mad…”, I tried to observe query execution data at a runtime. And… nothing. Query is running, DMV is empty. That’s really disappointing unless I’m missing something of course. BOL doesn’t specify any configuration parameter that turns it on; so, I guess, it goes step by step: in CTP1 we could only see it (in BOL), in CTP2 we can smell – hopefully in CTP3 it would finally happen. Otherwise I will be not so slightly mad.
Caught something. But this time it is even more funny – while I had 2 concurrent sessions executing queries, all that I’ve caught was my own query – from sys.dm_exec_query_profiles!
Looks like query appears in the DMV only when actual execution plan is requested. So if you execute query from Management Studio and ask for an actual plan – you’ll be able to trace query progress in sys.dm_exec_query_profiles. Otherwise wait for next CTP or RC or RTM – whatever. Thanks to Matan Yungman (see his comment below) for an idea.
As I’m downloading SQL Server 2014 CTP2 at this very moment, I’ve noticed new fascinating feature that hadn’t been announced in CTP1: delayed transaction durability. It means that if your system is heavy on writes and on another hand you can tolerate data loss on some rare occasions – you can consider declaring transaction as DELAYED_DURABILITY = ON. In this case transaction would be committed when log is written to some buffer in memory – not to disk as usual. This way transactions can become much faster – especially when log IO is the bottleneck. Delayed durability can smoothen write peaks. Looking forward to use it. But… unfortunately, this feature works for In-Memory OLTP only! Which greatly reduces its effectiveness. Prior to noticing this limitation, I’ve already imagined how I can use it on one of my heaviest procedures that generates tens and sometimes hundreds megabytes of log every time it runs. And no, it isn’t OLTP procedure, it works with BLOBs, so Hekaton isn’t an option. And I can always reload this data if something fails. So delayed durability could potentially reduce procedure duration by 30-40%. Potentially – in case Microsoft would some day implement it for really heavy writers and not only for In-Memory OLTP.
I guess, primary use of this feature is in extremely loaded environments with IO bottleneck on transaction log – in this case even millisecond or several microseconds reduced from transaction duration can prevent collision. And possible failure because as you probably already know, Hekaton treats conflicts in different way: instead of blocking, transaction can just fail on inconsistency.
Columnstore index in SQL Server 2014 contains 2 new important features: it can be clustered and it is updateable. So I decided to play with both. As a “control group” I’ve taken my old columnstore index demo from one of the ISUG (Israeli SQL Server Usergroup) sessions. The script itself isn’t important – it creates partition function with 7 partitions (actually 8 but one remains empty), table on it and populates the table with 63 million rows – 9 million in each partition. So I used the same script for SQL 2014 but this time created clustered columnstore index on empty table. So population should have updated the index. I had 2 issues to check:
2) According to the following article, columnstore index is impossible to update “in-place” due to its highly compressed structure. So “deltastores” are used. What I wanted to check is whether metadata accuracy is affected.
Regarding sizing, as expected, columnstore index provides better compression than anything else. Table in SQL 2012 contained clustered index that used page compression and non-clustered columnstore index. Clustered index size was 3.8GB, non-clustered columnstore index “weighted” 3.04GB. Clustered columnstore index in SQL 2014 “weighted” 2.87GB only. So instead of 6.84GB in SQL 2012, in SQL 2014 we’ve used only 2.87GB of diskspace.
Now to the metadata part. While script for table population had been running in SQL 2012, both sp_spaceused and rows column in sys.partitions showed pretty accurate picture. Not so with SQL 2014 when clustered columnstore index is being updated. Here is the output of sp_spaceused and from sys.partitions after population script finished to run and inserted 63 million rows into a table (7 partition, 9 million rows in each one):
Not even close, especially sp_spaceused. By the way, you can notice that I even tried to use second parameter of sp_spaceused: @updateusage. Nothing happened. As anticipated, what fixed the problem was index rebuild. That’s what we indeed have in the table:
So beware that the deltastore algorithm that is used for updating columnstore index leaves metadata very inaccurate, especially sp_spaceused. So don’t rely on it unless Microsoft finds a way to fix it.
Next question, much more important for SQL Server proper functioning is whether statistics also enter such a weird condition after massive columnstore index update. If so that’s not a “no go”, it just means that manual update of statistics is required after every massive columnstore index update.
Downloaded and installed new toy:
And while reading BOL, stumbled upon new extremely useful DMV: sys.dm_exec_query_profiles. This DMV enables DBA to monitor query progress while it is being executed. Counters in the DMV are per operation per thread. So we’ll be able to monitor in real time which thread (even for parallel processing) processes which node in the plan. Or find heavy operations “post mortem”. We all know the uncomfortable feeling when some heavy query runs and the boss starts asking “stupid” questions like “when would it finish” or “what percent of execution is already beyond us”. We don’t have answer for that. New DMV provides at least partial answer as first time it enables to drill down into query execution progress. That’s not the “percent_completed” column that we monitor on DBCC CHECKDB or BACKUP operations. But as close as it can be.
See you next CTP, I guess.
A week ago I gave my session on the subject. It is only first part of the two – second will be next time when there is an open slot. Demos from my session are available here.
Recently I had to take care of the most interesting corruption case I’ve even seen, so decided to share this experience with you. We’re talking about small accounting program which keeps its data in SQL Server Express – in this particular case in SQL Server 2005. The customer called today and sent me following error screen (nice screenshot – taken with cellular phone camera ):
Upon connecting to the server I’ve immediately noticed dumps with the same error. Here is entire error message:
A time-out occurred while waiting for buffer latch -- type 2, bp 04268450, page 1:804, stat 0xc00009, database id: 5, allocation unit Id: 72057594108248064, task 0x00A186B8 : 0, waittime 300, flags 0x1a, owning task 0x00A0A4D8. Not continuing to wait.
This article was extremely helpful in analyzing this dump with WinDbg tool – flow and somewhat cryptic commands described there easily pinpointed the guilty thread and its call stack:
Highlighted part of the call stack indicates that Ghost Cleanup process caused this failure. Actually, I had a similar encounter with stuck ghost cleanup several years ago. The easiest way to verify that it is still stuck is to execute DBCC CHECKDB command. Indeed CHECKDB appeared to be blocked by ghost cleanup session.
What’s next? You can’t kill system session. Indeed you can’t by you can start SQL Server without it – using trace flag 661 as startup parameter (don’t forget to remove flag and restart service again if you use this flag). So, after restarting service, rebuilding index in question, removing trace flag and restarting service again, I’ve already thought that
I’ve earned my beer database is fixed.
Not so fast. Now DBCC CHECKDB succeeds to complete but results are very very red. Including interesting messages like:
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=1) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=2) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,column_id=3) in sys.columns does not have a matching row (object_id=1575689407) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1575689407) of row (object_id=1575689407,index_id=0) in sys.indexes does not have a matching row (object_id=1575689407) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3855, State 1: Attribute (data_space_id=1) exists without a row (object_id=1575689407,index_id=0) in sys.indexes.
Nice, uh? After first wave of shock has passed, I checked and found out that indeed object with that id doesn’t exist. So it seems that all we need to do is to delete 3 rows from sys.columns and 1 row from sys.indexes. Ah, but those sys.something objects are views, aren’t they? And what are the real objects? The way to find real – internal – tables and columns is via execution plans:
So actually we need to delete rows from sys.syscolpars and from sys.sysidxstats (notice that column names are also different). How do we do it? Let’s try DAC (Dedicated Admin Connection)? No way – Express Edition doesn’t support DAC. Unless… unless we use trace flag 7806 as startup parameter.
Restart server again, connect using DAC, try to delete rows… Oops,
Msg 259, Level 16, State 1, Line 1
Ad hoc updates to system catalogs are not allowed.
But for this we have Paul Randal’s instructions. So: “ sqlservr.exe -sInstanceName -m -T661 –T7806 “, then “ sqlcmd -S.\InstanceName /A ” and finally…
1: DELETE FROM sys.syscolpars WHERE id = 1575689407 AND number = 0
3: DELETE FROM sys.sysidxstats WHERE id = 1575689407 AND indid = 0
Now stop the server, remove all trace flags, start server in a normal way and verify that DBCC CHECKDB returns nothing. Bingo! And… well deserved my own home-brewed Scottish Ale!
Recently sp_blitz procedure on one of my OLTP servers returned alarming notification about high latency on one of the disks (more than 100ms per IO). Our chief storage guy didn’t understand what I was talking about – according to his measures, average latency is only about 15ms. In order to investigate the issue, I’ve recorded 2 snapshots of sys.dm_io_virtual_file_stats and calculated latency per read and write separately. Results appeared to be even more alarming: while for read average latency was only 9ms, for write it skyrocketed to 260ms. Even more strange – PerfMon results for the same time interval showed similar read latency but only 17ms per write! We’ve almost opened support request for it – during informal phone conversations with colleagues in Microsoft nobody could explain it.
Finally we’ve found the answer after more thorough analysis of PerfMon file. It appears that calculation formula is different. sys.dm_io_virtual_file_stats contains total number of writes and total write latency (stall), so by dividing them we receive average latency per single write operation. PerfMon on another hand measure Disk sec/write counter every sample interval (default is 15 seconds, I used 1 second interval in order to see more detailed picture). And according to PerfMon, average latency per write during certain period is average of samples measured during this period. Disk we’re talking about contains only datafiles of a single database. So we’ve noticed that while number of read operations per second is pretty equal along entire interval, writes behave totally different. Most of the time we see 0 (zero) writes – and 0 write latency accordingly. But once every minute we see massive writes – SQL tries to write almost 4000 IOs per second for several seconds in a row (several means sometimes 2, sometimes 10). I guess, it is checkpoint. And during these peaks latency jumps to huge values – up to 1.8 seconds (!) per write. The catch is that when we have 59 samples with 0 latency (we don’t have writes, remember?) and 1 second with 120ms latency, PerfMon with calculate average and receive (59*0 + 1*120)/60 = 2ms per write latency. sys.dm_io_virtual_file_stats at the same time will report 120 ms / write. And indeed, when I calculated average write latency from PerfMon but took into account only samples with non-zero number of writes, results became very close to what sys.dm_io_virtual_file_stats showed.
So who is right in this story and should we do something with the disk? Both are right – they just calculate different numbers. sys.dm_io_virtual_file_stats calculates average latency per IO but doesn’t provide any indication whether load is linear or contains peaks and idle periods. Peaks have much more influence – because peaks in latency are usually accompanied by peaks in number of IO operations, so their relative weight is higher. Perfmon on another hand, when it comes to calculate average over period (for example, PAL tool does it automatically), smoothens peaks because for it every sample – the one with zero writes and the one with 10K writes per second – has same relative weight.
Here is how our latency graph looks:
My first thought was: “Checkpoint is background process, shouldn’t really affect user experience – storage guy is right, leave the disk alone”. But then I decided to check read latencies – indeed, when it is so busy that write takes 1.8 seconds, there is no reason to suppose that reads fill any better. As I said in the beginning, overall average read latency was 9ms. But when separated for write-idle and checkpoint periods, results were completely different. For write-idle periods, read latency was 1.5ms. But during checkpoint it jumped to 120ms. So user experience is affected – read queries that go to disk can run very slow or even appear as stuck during checkpoints. Beware .
Recently I had to analyze huge trace – 46GB of trc files. Looping over files I loaded them into trace table using
fn_trace_gettable function and filters I could use in order to filter out irrelevant data. I ended up with 6.5 million rows table, total of 7.4GB in size. It contained RowNum column which was defined as identity, primary key, clustered. One of the first things I detected was that although time difference between first and last events in the trace was 10 hours, total duration of all sql events was only 4 hours. So the next assumption to check was whether application got stuck - whether we have significant gap between two subsequent events. In the following query I’m looking for events that occurred at least 30 seconds after previous event’s completion.
SQL 2005/2008/2008R2 syntax:
curEvent.RowNum, prevEvent.EndTime AS PrevEventEndTime, curEvent.StartTime AS CurEventStartTime,
DATEDIFF(SECOND, prevEvent.EndTime, curEvent.StartTime) AS ApplicationStuckDuration
INNER JOIN dbo.TraceData prevEvent ON curEvent.RowNum - 1 = prevEvent.RowNum
WHERE DATEDIFF(SECOND, prevEvent.EndTime, curEvent.StartTime) >= 30
SQL 2012 syntax:
;WITH GapsCTE AS
SELECT RowNum, StartTime AS CurEventStartTime, LAG(EndTime, 1, NULL) OVER(ORDER BY RowNum ASC) AS PrevEventEndTime
SELECT RowNum, PrevEventEndTime, CurEventStartTime, DATEDIFF(SECOND, PrevEventEndTime, CurEventStartTime) AS ApplicationStuckDuration
WHERE DATEDIFF(SECOND, PrevEventEndTime, CurEventStartTime) >= 30
Here are results (pretty surprising at the first glance, I have to say):
| ||SQL 2012 ||SQL 2005/2008/2008R2 |
|Duration ||103 seconds ||77 seconds |
|Reads ||919,752 ||1,839,504 |
|CPU ||20 seconds ||13 seconds |
In the old SQL versions we see twice reads – that’s obvious from execution plan since we scan entire table twice and merge join on RowNum column. It seems that in this case Merge Join is so efficient that CPU overheard of segmentation, aggregation and spooling costs more than additional single ordered scan of the table that is already in memory anyway.
I started to write all this in order to demonstrate practical usage of the new function and how we can benefit from it compared to previous SQL Server versions. Results became a complete surprise for me.
Conclusion: always check and recheck yourself!
It happens to me at least once a week – I want to check progress of some heavy script that runs in chunks over big dataset and find out that it writes intermediate data to temporary table only. Last time it happened 3 days ago when I wanted to analyze 50GB trace table on my notebook. I wrote a script that was taking 200 thousand rows at a time, parameterizing them and aggregating by different keys – host name, application etc. Usual trace analysis stuff. After an hour I wanted to check the progress but found out that intermediate results are written to temp table and of course I forgot to add debug prints. Took me some thought and ~5 minutes of coding to find the solution. How?
My trace analysis script ran over 200K rows chunks and among other things aggregated them grouping by parameterized query text. One of the measurement columns was “TotalQueries” – counter of rows in a group that could be later used to calculate average values for Reads, Writes, Duration and CPU. Every 200k rows chunk inserted rows into #BA temp table. Last step in a script, when all chunks have already been processed, was aggregation of #BA table’s data into permanent table. So in order to monitor the progress I could SUM all values of “TotalQueries” column in #BA table and compare it to number of rows in my 50GB trace table (which can be easily verified using
Of course, I new the temp table name – I wrote the initial script. So first thing was to find its object_id.
SELECT [object_id] FROM tempdb.sys.tables WHERE name LIKE '#BA%'
It brought me negative number: –1546597904 in my current demo. I don’t have access to temp table created by another session. But as admin on my own notebook I sure have access to every page in every database. So next step is to find all pages that belong to #BA table. For this task we have undocumented but widely known (and used) DBCC IND command. Since digging in every page manually and wasting hour on it wasn’t exactly my purpose, I kept DBCC IND output in the table variable and used it later in the script.
DECLARE @BATablePages TABLE
PageFID INT, PagePID INT, IAMFID INT, IAMPID INT, ObjectID BIGINT, IndexID INT, PartitionNumber INT,
PartitionID BIGINT, iam_chain_type VARCHAR(64), PageType TINYINT, IndexLevel TINYINT, NextPageFID INT,
NextPagePID INT, PrevPageFID INT, PrevPagePID INT
INSERT INTO @BATablePages
EXEC('DBCC IND(2, -1546597904, 1)')
Actually we need only PagePID column and only for leaf level data pages e.g. PageType = 1. Next step is to loop over those pages and insert their content into another table. For viewing page’s data I used another widely known (and undocumented as well) DBCC PAGE command. It is less known that DBCC PAGE can be used along with WITH TABLERESULT suffix. Note: column definitions in the following script are accidental – I just wanted them to be wide enough so that script can complete successfully.
DECLARE @rc INT, @v_PageID INT, @sql NVARCHAR(MAX)
CREATE TABLE #ind( RowNum INT NOT NULL IDENTITY(1,1), PageNum INT NOT NULL )
CREATE TABLE #page(ParentObject VARCHAR(128), [Object] VARCHAR(512), Field VARCHAR(128), [Value] VARCHAR(8000) )
INSERT INTO #ind( PageNum )
WHERE PageType = 1
SET @rc = @@ROWCOUNT
WHILE @rc > 0
SELECT @v_PageID = PageNum FROM #ind WHERE RowNum = @rc
SET @sql = 'DBCC PAGE(2, 1, ' + CONVERT(VARCHAR(32), @v_PageID) + ', 3) WITH TABLERESULTS'
INSERT INTO #page(ParentObject, [Object], Field, [Value])
SET @rc = @rc - 1
So now we have all data pages inside single table but not in a format we’re used to:
But actually, we now have everything we need. The rest is simple:
DELETE FROM #page WHERE Field <> 'TotalQueries'
SELECT SUM(CONVERT(INT, [Value])) AS TotalQueries
2 years ago I wrote about monitoring page splits with Extended Events. Only 2 bloggers explored Extended Events feature at that time, so my post was more of a learning than for any practical matter. Yet one question remained open: according to straightforward calculations, after 3 splits of the first page, next split should have occurred on another page - but trace with extended events clearly showed 4 splits of initial page.
Recently Pavel Nefyodov has drawn my attention to another curious fact: initially we had 1 page; after all those splits we had 10. But trace showed 10 splits, not 9. So where did additional split come from? And finally I've realized that there is another simple way to trace those splits - via transaction log. If we manually checkpoint the system before and perform insert in transaction afer - our transaction log will containg exactly the relevant data; and it'll remain this way all the time transaction stays open. We'll query transaction log using undocumented but widely known fn_dblog function.
OK, Let's see some code (mostly taken from my old post which in turn used page split demo from Kalen Delaney's blog). Initial configuration (including trace definition):
CREATE TABLE split_page
(id INT IDENTITY(0,2) PRIMARY KEY,
id2 bigint DEFAULT 0,
data1 VARCHAR(33) NULL,
data2 VARCHAR(8000) NULL);
INSERT INTO split_page DEFAULT VALUES;
CREATE EVENT SESSION MonitorPageSplits ON SERVER
ADD EVENT sqlserver.page_split
ACTION (sqlserver.database_id, sqlserver.sql_text)
WHERE sqlserver.database_id = 2
ADD TARGET package0.asynchronous_file_target
filename = N'c:\temp\MonitorPageSplits.etx',
metadatafile = N'c:\temp\MonitorPageSplits.mta'
Short glance at our table's allocation inside data file:
DBCC IND(tempdb, split_page, -1);
|PagePID ||IndexID ||iam_chain_type ||PageType ||IndexLevel |
|175 ||1 ||In-row data ||10 ||NULL |
|148 ||1 ||In-row data ||1 ||0 |
So we have 1 IAM page and 1 data page. Now the split itself and - remember, we should query transaction log before committing transaction if we don't want accidental checkpoint to ruine all our efforts.
ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = start;
SET IDENTITY_INSERT split_page ON;
INSERT INTO split_page (id, id2, data1, data2)
SELECT 111, 0, REPLICATE('a', 33), REPLICATE('b', 8000);
SET IDENTITY_INSERT split_page OFF;
SELECT [Current LSN], Operation, Context, [Page ID], [Description]
FROM fn_dblog(NULL, NULL)
ALTER EVENT SESSION MonitorPageSplits ON SERVER STATE = STOP;
DROP EVENT SESSION MonitorPageSplits ON SERVER;
split.value('(/event/data[@name=''file_id'']/value)','int') AS [file_id],
split.value('(/event/data[@name=''page_id'']/value)','int') AS [page_id],
split.value('(/event[@name=''page_split'']/@timestamp)','datetime') AS [event_time],
split.value('(/event/action[@name=''sql_text'']/value)','varchar(max)') AS [sql_text]
SELECT CAST(event_data AS XML) AS split
FROM sys.fn_xe_file_target_read_file('c:\temp\MonitorPageSplits*.etx', 'c:\temp\MonitorPageSplits*.mta', NULL, NULL)
) AS t
ORDER BY [event_time]
First let's take a look at our table now:
|PagePID ||IndexID ||iam_chain_type ||PageType ||IndexLevel |
|175 ||1 ||In-row data ||10 ||NULL |
|148 ||1 ||In-row data ||1 ||0 |
|184 ||1 ||In-row data ||2 ||1 |
|185 ||1 ||In-row data ||1 ||0 |
|187 ||1 ||In-row data ||1 ||0 |
|188 ||1 ||In-row data ||1 ||0 |
|189 ||1 ||In-row data ||1 ||0 |
|190 ||1 ||In-row data ||1 ||0 |
|191 ||1 ||In-row data ||1 ||0 |
|392 ||1 ||In-row data ||1 ||0 |
|393 ||1 ||In-row data ||1 ||0 |
|394 ||1 ||In-row data ||1 ||0 |
Transaction log contains more rows than we are interested in, so I cleaned irrelevant ones out from the table below and correlated transaction log with extended events trace. I also transformed hexadecimal page numbers from transaction log to decimal ones - like we see in extended events trace.
So we can see that indeed there have been 3 splits of the first page. From transaction log we see that first action was actually allocating of page 184 which is - look at the allocation table above - index page. According to transaction log this page is allocated - not splitted from any other page. Still extended events trace shows as if it has been splitted from the first data page of our table. Don't know whether this behavior can be considered as bug - probably not. Anyway - both unanswered question from my old post are resolved: first page indeed has been splitted 3 times and we had only 9 data page splits while 10th split belongs to index page allocation.
Have a nice week!
From BOL, "Key-Range Locking" article:
Key-range locks protect a range of rows implicitly included in a record set being read by a Transact-SQL statement while using the serializable transaction isolation level. The serializable isolation level requires that any query executed during a transaction must obtain the same set of rows every time it is executed during the transaction. A key range lock protects this requirement by preventing other transactions from inserting new rows whose keys would fall in the range of keys read by the serializable transaction. (bold is mine - MZ)
I had a blocking locks problems recently. What should have been working concurrently from multiple threads actually worked single-threaded only. Here is demo - table of employees and indexed view above it that shows number of employees and total salary per department. I populated it randomly with up to 10 deparments.
SET ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, CONCAT_NULL_YIELDS_NULL, QUOTED_IDENTIFIER ON
SET NUMERIC_ROUNDABORT OFF
CREATE TABLE dbo.Employees
EmployeeID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
DepartmentID SMALLINT NOT NULL,
Salary MONEY NOT NULL
CREATE VIEW dbo.V_Departments
SUM(Salary) AS DepartmentSalary,
COUNT_BIG(*) AS TotalEmployees
GROUP BY DepartmentID
CREATE UNIQUE CLUSTERED INDEX PK_V_Departments ON dbo.V_Departments( DepartmentID ASC )
INSERT INTO dbo.Employees( DepartmentID, Salary )
[object_id] % 10 + 1 AS DepartmentID,
[object_id] % 3000 + 1 AS Salary
Now I open 2 concurrent transactions and try to add new employee from both. The important thing here is that both belong to new departments: 11 and 13 (in the demo data we had departments from 1 to 10).
|Session 1 ||Session 2 |
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
INSERT INTO dbo.Employees( DepartmentID, Salary )
VALUES( 11, 2500 )
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
INSERT INTO dbo.Employees( DepartmentID, Salary )
VALUES( 13, 3500 )
Second transaction is locked. Why? We don't have anything else that updates department 12, so why it is locked? Let's find out
OBJECT_NAME(i.[object_id]) AS ObjectName,
i.name AS IndexName,
l.request_mode AS LockRequestMode,
l.resource_type AS LockResourceType,
l.resource_description AS LockResourceDescription
sys.dm_tran_locks AS l
INNER JOIN sys.dm_os_waiting_tasks AS wt ON l.lock_owner_address = wt.resource_address
INNER JOIN sys.partitions p ON p.hobt_id = l.resource_associated_entity_id
INNER JOIN sys.indexes AS i ON i.[object_id] = p.[object_id] AND i.index_id = p.index_id
WHERE wt.session_id = 57
AND l.request_status = 'WAIT'
RangeS-U mode means shared lock for the range between values and update lock for values themselves. One value is obvious - department 11 that we've inserted in the first session. Second value, as we see in the query result, is (ffffffffffff) which indicates max int. Storage Engine knows to lock rows and other objects that are scanned by Query Processor. But we don't have row with department greater than 11. So right boundary of locked range is set to maximum possible value.
So we understand what range is locked. But why? If we work in pure read committed isolation level, we hold exclusive lock on rows we modify but nobody prevents another user from inserting or deleting another row with same department value. But we expect that our operation would update corresponding department row in the view before anybody else updates same row (e.g. if we read data after our transacion is complete, it won't be affected by other transactions). Which is classic serializable behavior. So instead of forcing customer to require serializable isolation level expicitely, SQL Server implicitely converted read-committed to serializable for this particular operation. It appears that same scenario occurs with cascading updates on foreign keys.
Almost every applicative DBA has scripts for locking and blocking issues detection. Some, like me, wrote the script themselves in order to become familiar with underlying DMVs, others downloaded one of the huge number of versions available in the net. All of these scripts are based on
sys.dm_tran_locks DMV which contains information about currently active locks and on
sys.dm_os_waiting_tasks that holds infomation about waiting tasks. Some of the scripts also filter out locks acquired by system sessions by adding "
WHERE request_session_id > 50". My own script was among them.
Recently I had a blocking issue. But for some reason my script didn't show blocking locks tree. Deeper investigation showed that both blocker and blocked session belonged to Service Broker. But why script failed to indicate it? Following query provides an answer - it shows spid of Service Broker activated tasks:
SELECT spid FROM sys.dm_broker_activated_tasks
As you can see, common belief that all spids under 50 belong to internal system sessions is just a myth. Actually, it was right at SQL 2000 days. Starting from SQL 2005 Serive Broker (and, I think, other features too) started to use first 50 spids.
Very nice ebook and script collection "Free SQL Server DMV Starter Pack" inspired me to write this post. It is really nice and "no nonsense" - I'll recommend it as a great starter to young DBAs that don't have their own script collection. Still script for detecting active locks contains the same old familiar mistake:
WHERE request_session_id > 50