THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Michael Zilberstein

Beware of SQL Server and PerfMon differences in disk latency calculation

Recently sp_blitz procedure on one of my OLTP servers returned alarming notification about high latency on one of the disks (more than 100ms per IO). Our chief storage guy didn’t understand what I was talking about – according to his measures, average latency is only about 15ms. In order to investigate the issue, I’ve recorded 2 snapshots of sys.dm_io_virtual_file_stats and calculated latency per read and write separately. Results appeared to be even more alarming: while for read average latency was only 9ms, for write it skyrocketed to 260ms. Even more strange – PerfMon results for the same time interval showed similar read latency but only 17ms per write! We’ve almost opened support request for it – during informal phone conversations with colleagues in Microsoft nobody could explain it.

Finally we’ve found the answer after more thorough analysis of PerfMon file. It appears that calculation formula is different. sys.dm_io_virtual_file_stats contains total number of writes and total write latency (stall), so by dividing them we receive average latency per single write operation. PerfMon on another hand measure Disk sec/write counter every sample interval (default is 15 seconds, I used 1 second interval in order to see more detailed picture). And according to PerfMon, average latency per write during certain period is average of samples measured during this period. Disk we’re talking about contains only datafiles of a single database. So we’ve noticed that while number of read operations per second is pretty equal along entire interval, writes behave totally different. Most of the time we see 0 (zero) writes – and 0 write latency accordingly. But once every minute we see massive writes – SQL tries to write almost 4000 IOs per second for several seconds in a row (several means sometimes 2, sometimes 10). I guess, it is checkpoint. And during these peaks latency jumps to huge values – up to 1.8 seconds (!) per write. The catch is that when we have 59 samples with 0 latency (we don’t have writes, remember?) and 1 second with 120ms latency, PerfMon with calculate average and receive (59*0 + 1*120)/60 = 2ms per write latency. sys.dm_io_virtual_file_stats at the same time will report 120 ms / write. And indeed, when I calculated average write latency from PerfMon but took into account only samples with non-zero number of writes, results became very close to what sys.dm_io_virtual_file_stats showed.

So who is right in this story and should we do something with the disk? Both are right – they just calculate different numbers. sys.dm_io_virtual_file_stats calculates average latency per IO but doesn’t provide any indication whether load is linear or contains peaks and idle periods. Peaks have much more influence – because peaks in latency are usually accompanied by peaks in number of IO operations, so their relative weight is higher. Perfmon on another hand, when it comes to calculate average over period (for example, PAL tool does it automatically), smoothens peaks because for it every sample – the one with zero writes and the one with 10K writes per second – has same relative weight.

Here is how our latency graph looks:

image

My first thought was: “Checkpoint is background process, shouldn’t really affect user experience – storage guy is right, leave the disk alone”. But then I decided to check read latencies – indeed, when it is so busy that write takes 1.8 seconds, there is no reason to suppose that reads fill any better. As I said in the beginning, overall average read latency was 9ms. But when separated for write-idle and checkpoint periods, results were completely different. For write-idle periods, read latency was 1.5ms. But during checkpoint it jumped to 120ms. So user experience is affected – read queries that go to disk can run very slow or even appear as stuck during checkpoints. Beware Smile.

Published Wednesday, November 28, 2012 4:20 AM by Michael Zilberstein
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Greg Linwood said:

Interesting article. The main reason for using io_virtual_file_stats vs perfmon is to try to obtain information at the database file level vs the whole disk or volume level.

Ideally we would be able to capture perfmon style calculations (transient rather than aggregated) but at database file level rather than at disk / volume level.

Even better would be to be able to capture this information at the index level but that might be getting a little greedy for a paying customer

November 28, 2012 3:42 AM
 

Michael Zilberstein said:

Greg, there is more than one perspective here. As a consultant, I need some "quick and dirty" diagnostics tools that can pinpoint problems without spending much time on capturing and analyzing traces. Procedures like sp_blitz are ideal here. Of course if I find bottleneck on the disk level, I'll dig in file level and find which files to reallocate to other disks.

BTW, in my case next step isn't reallocating files but checking storage. If storage is configured properly, write latency should be less than read one. Reads go to disk while writes should go to storage cache. I guess, here we have storage configured to write-through or low queue depth configuration value.

November 28, 2012 10:00 AM
 

Greg Gonzalez said:

Hi Michael,

I wanted to point out that you can easily convert the sys.dm_io_virtual_file_stats data to the same numbers as PerfMon using this formula:

(x2 – x1) / (y2 – y1) / 1000

Where:

x1 = io_stall_ms start value

x2 = io_stall_ms end value

y1 = num_ios start value

y2 = num_ios end value

Perflib is effectively performing this calculation every time Perfmon queries it, based on the sample interval, but using latency ms and number of IOs at the disk level. If you were to sample virtual_file_stats at the exact same frequency and perform this calculation, you’d see very close to the same numbers. You’d obviously need to take disks and database file locations into account when doing the comparison.

I hope this is helpful!

November 28, 2012 5:19 PM
 

Michael Zilberstein said:

Hi Greg,

You're of course right, that's pure math. But 2 points:

1. Your counter isn't "Disc sec\write" but "Disc sec\write\sec".

2. Thinking straightforward, you don't expect "disc sec\write" counter to be influenced by frequency of sampling. While in this case sampling once every 2 seconds instead of every second will multiply the result by factor of 2.

November 28, 2012 5:40 PM
 

Greg Gonzalez said:

#1: Not sure I follow exactly, but I'm converting avg ms per IO to avg seconds per IO to keep in sync with the Perflib counter which uses seconds, not ms.

#2: It's not influenced by sample frequency. If you take the delta of stalls ms and IOs over any period and divide, you'll get the avg ms per IO over that period. The sample frequency is irrelevant. Your assertion that PerfMon does a different calculation which weights samples disproportionately is not accurate. It calcs deltas between samples and divides, and the formula works the same way over 1 second or 60. As you said, it's pure math ;-)

Here is a shot of the SQL Sentry dashboard showing total Windows disk and total SQL Server file latency, collected from Perflib and file_stats respectively:

http://www.sqlsentry.net/images/temp/windowssqllatency.png

Notice the patterns are almost identical. We convert the raw disk latency counter to ms, and the file_stats latency is calc'd using the formula above, both over a 20 second sample by default. This is a dedicated SQL system with multiple active disks and databases.

If you're going to compare the data from Windows and SQL, you must use the exact same frequency, and the samples must be taken simultaneously (or as close as you can get). If you don't, you can and will see timing related discrepancies. That may explain what you saw originally, not sure.

If you're interested we cover this issue and more fun stuff in the patent app for our disk analysis module:

http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220100214298%22.PGNR.&OS=DN/20100214298&RS=DN/20100214298

Cheers

November 28, 2012 7:44 PM
 

Michael Zilberstein said:

#1. You're right, just converted milliseconds to seconds.

#2. I see you point. But that's not the way you usually work with PerfMon. You don't take 2 snapshots (and as far as I understand you have to write code in order to do it - there is no total cumulative counter for disk latency like in virtual_file_stats) but rather record counter values every 10 or 15 seconds (every single second in my case) and then start to analyze. So when you come to analyze interval, you have no other way but to calculate some sort of average between counters. That's what PAL tool does, for example. And that's the only way you can compare it to virtual_file_stats from SQL because contrary to PerfMon, it contains only cumulative numbers.

Actually, there is way to receive same results from PerfMon. Number of writes should be captured along with latency. And _weighted_ average of latency metric values should be calculated instead of simple average. Weighted - by number of write operations. So when num of disk writes is zero, relative latency weight in the formula will also be zero at that sample.

November 28, 2012 8:16 PM
 

Michael Zilberstein said:

Patent application is really funny read - seems like same English language but so much different :-).

Greg, do you have historical analysis module in your tool or it contains only real-time monitoring? If you have and user chooses some custom time period - how do you calculate latency during that period? Or you just present raw data (metrics) without calculating single latency number?

November 28, 2012 8:21 PM
 

Greg Gonzalez said:

That's right, you can easily "back in" to the total latency per sample by multiplying the avg sec per IO times the number of IOs for the same interval. These samples can then be added up and divided by the total IOs for the range to get the same type of number you'd get from file_stats.

If you look under the covers at the raw Perflib data retrieved by Perfmon (or our app), the Avg Disk sec/Read|Write counter returns both the running total IOs and the running total latency in ms (BaseValue and RawValue properties respectively)... this is exactly the same way file_stats stores it, as running totals. PerfMon (or the app) then calls a function on the samples using the formula above to come up with the numbers you see.

I just wanted to ensure it was clear that there is no inherent problem with either data source... it all comes down to how you are processing and using the data.

Yes, it's amazing how the lawyers can take simple concepts and make them seem complex ;-)

Yes, SQL Sentry is all about historical data and analysis -- you can look at these and other metrics over any date range. Ranges <=30 minutes show raw samples (20 sec for latency), and >30 min ranges show averages over varying "resolution intervals" depending on the range size.

November 29, 2012 7:17 PM
 

Uri Dimant said:

Hi Greg

Just wanted to make sure regarding the formula

x1 = io_stall_ms start value

x2 = io_stall_ms end value

What does it mean start and end values? It has io_stall_read_ms?

March 20, 2014 5:13 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement