Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.
Thanks to everyone who took the time out of their conference experience to join Mike Wachal and me for yesterday's session on SQL Server performance tuning!
For those who weren't there, we focused in on troubleshooting techniques, highlighting some of the key DMVs and new Extended Events features that will help with proactive diagnosis of problems. My section, in particular, was a bit of a taste of some of the sessions I've submitted for this fall's PASS summit: I showed a brief demo of troubleshooting using my No More Guessing methodology, and applied it to the problem of workspace memory contention. The sessions for PASS cover these topics in detail, so if you find this area interesting please vote here.
The demo script I showed yesterday is attached, along with a copy of the latest versions of my Who is Active stored procedure and the SQLQueryStress tool. Feel free to leave a comment below or drop me a line via e-mail if you have any questions.
This post is part 30 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Pop quiz: What happens when you promise to write 30 posts on a topic, but you don’t plan properly and lose a bit of steam toward the end?
Answer: A recap post, a day or two early! Woo hoo!
The month started with a few introductory posts. A history of monitoring (from my point of view), some information on Who is Active’s design philosophy, and a few words on the Who is Active license.
The key takeaways from these first posts? The monitoring story has been pretty poor for most of SQL Server’s history. It’s better now, but it’s still a lot of work. Who is Active was created to make things more accessible. It’s free, with a few restrictions designed to keep people who make money from monitoring from making money on Who is Active. Anyone else can and should use and enjoy.
Next I got new users up to speed. My post on installation covered security considerations, followed by a post on why you may not see as much data as you're used to when you first start using Who is Active. I continued with a post on the various options supported by the procedure (plus a nod to the online help mode) and another post describing the default output columns.
One of the more important posts of the month described how Who is Active treats active requests and sleeping sessions. This can be a confusing area for new users, since it's not always obvious what's going on unless you're looking in the right place at the output (the [status] column).
Once the refresher was finished I began covering some of the basic options exposed by the procedure. Filters were one of the first things I implemented, so it was a logical place to start (the "not" filters came much later). Seeing the query text is a key part of the procedure's functionality, and that was next on my list. If you're not sure what the [sql_text] column means, this post will set the record straight.
The ability to see the query text is nice, but so is the ability to get a query plan--and of course Who is Active supports that too. And since everything a query does is transactional, the procedure allows users to collect information about what those transactions are up to.
The next section of the month was all about query processing. I started with a couple of background posts: One on how the query processor works (at a somewhat high level), and another on what blocking really means.
Who is Active exposes two different modes that help with analysis of real-time waits. My 15th post of the month covered the default--lightweight--collection mode. My followup post covered the more extensive full waits collection mode.
Sometimes a query plan and wait information isn't quite enough to diagnose a performance issue. What if the query plan usually works well, but isn't performing properly only in this specific case? Who is Active has a feature to help you figure that out. And what if you need a bit more information on some of the settings that the request is using?
Mining wait information yields some amazing returns. One of the additional pieces of information that you can get is the actual name of the object that's causing a block. Another thing you can see is the exact node within a query plan where a task is currently doing work. You can (and should!) also use waits to figure out whether tempdb is properly configured. There are still more tricks you can play with waits, but they'll have to wait for a future post.
Like wait information, lock data can also be mined. However, there is so much locks information that the real challenge is rendering it in a human-readable manner. Who is Active does that using a special custom XML format. The procedure also helps with another type of blocker analysis, in the form of finding the "lead blocker."
Once you've figured out what information you're interested in, why not set up the output so that you can see the information the way you want to see it? Who is Active helps with this by allowing users to dynamically customize output columns and sort order in a few different ways. And while you could always run Who is Active interactively, that might get a bit dull. Automated data collection is a much nicer methodology in many cases.
Most of the numbers shown by Who is Active are cumulative. But oftentimes it's more interesting to compare data between two snapshots. The procedure can help you do that, using its delta mode.
What fun would a month of Who is Active be without an official release? v11.00 has numerous new features, several of which were perfected this month thanks to feedback I received on the various posts in the series. It's great to have such enthusiastic users! Some of these new features didn't make it into prior posts, and other existing features are a bit hidden. So I did a writeup covering the more important things that you may not have noticed while working with Who is Active.
I finished the month with a discussion on security for slightly tougher situations. I hope that the module signing technique will allow Who is Active to be used in a number of places where security auditing requirements have made things difficult.
And that's that. A month of activity monitoring with Who is Active. Thank you for reading! Next, this text is going to be edited, expanded in places, and put into a much more comprehensive form. Watch your RSS reader for more information.
This post is part 29 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Early in the month I discussed basic security requirements for Who is Active. I mentioned the VIEW SERVER STATE permission and the fact that I consider it to be “a relatively low barrier to entry.”
But what if, in your organization, it’s not? Auditing requirements being what they are, you might be required to lock things down. And granting someone full and unrestricted VIEW SERVER STATE may simply not be an option.
Enter module signing. By securing Who is Active (or any other securable, for that matter) via inherited permissions, it’s often possible to get around auditing requirements, as long as the module itself has been reviewed. This is not at all a difficult thing to do, but in my experience most DBAs haven’t played much with signed modules. Today I’ll show you how quick and easy it can be to set things up.
Start by creating a certificate.
CREATE CERTIFICATE WhoIsActive_Permissions
ENCRYPTION BY PASSWORD = '1bigHUGEpwd4WhoIsActive!'
WITH SUBJECT = 'Who is Active',
EXPIRY_DATE = '9999-12-31'
Once you have a certificate in place, you can create a login from the certificate. The goal is to grant permissions, and to do that you need a principal with which to work; a certificate does not count. A login based on the certificate uses the certificate’s cryptographic thumbprint as its identifier. These logins are sometimes referred to as “loginless logins,” but I refer to them as “proxy logins” since that’s what they’re used for: proxies for the sake of granting permissions.
CREATE LOGIN WhoIsActive_Login
FROM CERTIFICATE WhoIsActive_Permissions
The login can be granted any permission that can be granted to a normal login. For example, VIEW SERVER STATE:
GRANT VIEW SERVER STATE
Once the permission has been granted, the certificate can be used to sign the module—in this case, Who is Active. When the procedure is executed, a check will be made to find associated signatures. The thumbprint of the certificates and/or keys used to sign the module will be checked for associated logins, and any permissions granted to the logins will be available within the scope of the module—meaning that the caller will temporarily gain access.
ADD SIGNATURE TO sp_WhoIsActive
BY CERTIFICATE WhoIsActive_Permissions
WITH PASSWORD = '1bigHUGEpwd4WhoIsActive!'
Getting to this step will be enough to allow anyone with EXECUTE permission on Who is Active to exercise most of its functionality. There are a couple of notes and caveats: First of all, every time you ALTER the procedure (such as when upgrading to a new version), the signature will be dropped and the procedure will have to be re-signed. You won’t have to create the certificate or the login again; you’ll just have to re-run that final statement. Second, you’ll only be able to use most of the functionality. Certain features, such as blocked object resolution mode, won’t operate properly, depending on whether the caller has access to the database in which the block is occurring. This may or may not be a problem—it depends on your environment and what users need to see—and Who is Active itself won’t throw an exception. An error message will be returned somewhere in the results, depending on what the user has tried to do.
If you would like to grant database-level permissions based on the certificate login so as to avoid these errors, that’s doable to. Just do something like:
CREATE USER WhoIsActive_User
FOR LOGIN WhoIsActive_Login
This will allow Who is Active to figure out what the various blocked or locked object names are. Since the login is just a proxy no one can actually log in and get direct access to read the data, so this isn’t something I consider to be a security risk. However, keep in mind that if anyone has the password for the certificate and sufficient privileges in master, a new module could be created and signed. Keep the password secure, and make sure to carefully audit to catch any infractions before they become a risk.
Security policy should never be a reason to limit your monitoring choices. Module signing is a powerful tool for Who is Active in addition to many other applications. I highly recommend studying it in detail in order to enhance your ability to provide high-quality, totally flexible, and completely secure solutions.
This post is part 28 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Over the past month this series has covered a lot of ground. After writing almost 30 blog posts on the stored procedure it’s interesting to look back and see the fact that it’s gotten much bigger and more complex than I realized. Kind of like watching a plant grow, you don’t notice the day-to-day changes until one day you go to water it and see that it’s taken over your entire garden.
A few things that I’ve added along the way have been especially useful, but there hasn’t been a good place to mention them thus far in the series. With just three posts left for the month, now is the time.
Always Show Blockers
In a post earlier this month I talked about filters. Filters let you decide exactly what you want to see, and what you not don’t want to see (“not” filters). But sometimes you have no choice: if you’re filtering so that you only see session 123, and it’s being blocked by session 456, you’ll also see information about session 456. The idea is that you should always get enough context to fully evaluate the problem at hand. Even if it means that you see more information than you were intending to see the first time around. You probably would have asked for 456 next anyway.
In @get_task_info = 2 mode, you may see waits called “RUNNABLE.” This could strike you as an oddity, given that there is no such wait type in SQL Server. I wanted to show tasks on the runnable queue, and making up a fake wait type seemed like a reasonable way of accomplishing the task. In practice, it has worked extremely well—I’ve used this feature countless times to help understand scheduler contention on a SQL Server instance.
Accurate CPU Time Deltas
CPU time is a tricky metric. It gets handled by Who is Active’s delta mode, and has for several versions. But historically, never very well. The data simply isn’t represented in an easily-obtainable fashion in the core DMVs. Recently I decided to dig deeper into this problem and discovered that I could get better numbers from some of the thread-specific DMVs. They’re cumulative numbers, based on the lifetime of the entire thread—not too good for the usual Who is Active output. But for snapshot and delta purposes, just about perfect. Meaning that in v11.00 of Who is Active, you can see the [CPU] column show a value of 0, while the [CPU_delta] column shows a value in the thousands. It’s not a bug. It’s a feature. (It really is!)
Text Query Plans and the XML Demon
SQL Server 2005 introduced query plans as XML. Management Studio knew how to render these plans graphically. And we were able to pull the plans from DMVs. Life was great. Except, perhaps, when you actually wanted to view one of these plans, and you had to save the thing out to a .SQLPLAN file, close the file, then re-open it. That’s about three steps too many for my taste, so I was overjoyed when the Management Studio team decided to wire things up the right way in SQL Server 2008. Click on a showplan XML document, see a graphical plan. Simple as that.
Unfortunately, the XML data type has its own issues, including one particularly nasty arbitrary limitation that has to do with nesting depth. The idea is to make sure that XML indexes don’t crash and burn too often (not a big concern for me, given that I’ve never seen one used in a production environment—but I digress). The problem is that query plans are heavily nested. And to get that nice graphical plan workflow, SSMS needs the plan rendered as XML.
In prior versions of Who is Active I gave up and returned either an error or a NULL. But in v11.00 I decided to make things a bit better. If the nesting issue occurs, Who is Active will now return the plan as XML encapsulated in some other XML in a text format, along with instructions on how to view the plan. This won’t give you a nice one-click experience, but it will give you the ability to use Who is Active to see some of the bigger plans that are causing performance issues.
Service Broker Needs Love Too
One of the most interesting features of Service Broker is activation. But a vexing design choice on behalf of the Service Broker team was to make activation procedures launch as system sessions. This means, among other things, that prior versions of Who is Active filtered them right out of the default view. To see them you’d have to enable @show_system_spids. And then you’d have to ignore all of the other system stuff. And you’d get woefully bad time information (no, the activation process hasn’t been running for 25 days; that’s the last time you restarted the SQL Server instance). In Who is Active v11.00 this has been fixed. Service Broker activation processes are now displayed by default along with other user activity. And I found a way to fix the timing issue, thanks to some advice on Twitter from Remus Rusanu, one of the guys who originally worked on Service Broker. So if you’re using activation and monitoring with Who is Active, life is good.
The homework section of this series is officially closed. Your assignment is to go enjoy the rest of your day.
This post is part 27 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Thanks to your feedback over the past month, I’ve managed to get a lot of work done on the next version of Who is Active.
So much work, in fact, that I’ve finished a new major release.
Click here to download Who is Active v11.00
There are numerous enhancements in this release. In no particular order:
- CPU deltas use-time thread-based metrics for more accurate data (use both @delta_interval and @get_task_info = 2)
- command_type information added to [additional_info] column for active requests
- Modified elapsed time logic to retrieve more accurate timing information for active system SPIDs
- SQL Agent job info (job name and step name) is now included in the additional_info column (use @get_additional_info = 1)
- If there is a lock wait, information about the blocked object (name, schema name, and ID) is now included in the additional_info column (use both @get_additional_info = 1 and @get_task_info = 2)
- Service Broker activated tasks are now shown by default, without using @show_system_spids mode. The program_name column contains the queue_id and database_id associated with the activation procedure
- Various numeric columns, including reads, writes, cpu, etc, have been made nullable. These will occasionally return NULL, on extremely active systems where the DMVs return data more slowly than queries start and complete
- Query plans that cannot be rendered due to XML type limitations are now returned in an encapsulated format, with instructions, rather than sending back an error
- Added wait information for OLEDB/linked server waits
- Wait collection will now "downgrade" to get_task_info = 1 style data if no other information is available in get_task_info = 2 mode
- Added header information to online help
- Added a login_time column to the output
- The duration for sleeping sessions is now the sleep time, rather than the time since login
- Fixed various bugs
I’ve written about several of these things over the past month, and there are a couple of posts left for the remainder of the month, so I won't elaborate here.
Thank you, thank you, and thank you again to everyone who has taken the time to give me feedback and/or report bugs! Who is Active would not be what it is without you!
I would especially like to thank Paul White. Paul has acted as my unofficial "lead QA engineer" for the past few versions. He's found some very interesting bugs in my code, in addition to pointing out some quirks in the DMVs that I wasn’t handling properly. Thanks, Paul, for the great job you’ve done!
One other thing I would like to point out is an addition to the header: a donation link. Several people recently have asked me how to donate, and I didn’t have a good answer. Now, I do.
Use this link to support the Who is Active project!
While I feel strange asking for money, the truth is that I’ve invested well over a thousand hours in the stored procedure, documentation, and support (I respond to EVERY e-mail I receive from Who is Active users). If you’ve been able to improve your day-to-day performance thanks to Who is Active, it would be great if you—or even better, your employer—could help support my work on the project. I’ve listed a recommended donation of $40 per installed instance, but please donate however much you feel Who is Active is worth to you and/or your organization.
Thanks again, and enjoy!
This post is part 26 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
The CPU is pegged. Your hard disks are fried. Memory consumption is through the roof. Who is to blame?!?
Well that’s easy. Blame the session with the most CPU consumption. Blame the session doing the most I/O operations. Ask Who is Active what’s going on—it’ll tell you...
Clearly session 52 is the problem here. It’s consumed over 2,000,000 milliseconds of CPU time. Session 53, on the other hand, has consumed a paltry five seconds. No brainer, right?
Except that it’s not.
Session 52? It’s in a wait state. (WAITFOR, to be exact.) Not consuming any CPU time at all at the time I ran Who is Active. Session 53? Using plenty. So why the discrepancy? Does Who is Active have a major bug?
The fact is—and this will be obvious to a lot of you, so forgive me—most of the metrics reported by Who is Active are cumulative. They’re totals from the entire session, or the entire lifetime of the request (depending on what’s going on). In the case of session 52 in the image above, that much CPU time was consumed over a five-minute period prior to the point in time when I ran Who is Active. Session 53, on the other hand, is currently consuming CPU time. But how do I figure that out?
The answer: delta mode. This feature is something for which you can thank Jimmy May, who kept telling me about his “SQL Deltoids” script that he’d written back in the bad old days of SQL Server 2000. “If only there were a way to apply that script to the SQL Server 2005 DMVs,” he lamented one day. So I added the functionality into Who is Active. It’s quite simple, and surprisingly effective.
To use delta mode, simply invoke Who is Active’s @delta_interval option. Setting an interval will cause the procedure to execute its main logic branch, then wait the specified amount of time—a number of seconds between 1 and 255—before running the same main logic branch again. All of the numeric values that changed during that period will be compared, and the changed numbers will be output in a series of columns named with _delta at the end.
For example, here are the same two sessions as above, viewed in delta mode:
@delta_interval = 5
During the five-second delta interval, session 52 consumed no CPU time. Session 53, on the other hand, consumed over 5,000 milliseconds of time. If the CPU is pegged, 52 is no longer contributing—53 is the session of interest.
Enabling delta mode will add some or all of the following columns to the output, depending on what other options have been set:
By leveraging these delta columns in conjunction with the @sort_order option, it’s easy to quickly see which sessions are currently consuming your valuable server resources—which is generally more interesting than seeing which sessions may have been consuming resources but are now waiting or sleeping.
Note: The various sessions and requests you'll see in delta mode pertain to the information captured after the wait interval. Information about requests that were running as of the first collection but are not running as of the second is discarded. New requests that started after the first collection will be shown, but with delta values of NULL. Locks, transactions, and other optional information, will also be captured only for the second collection, since no delta calculation is possible for those metrics (at least, not yet).
Delta mode, in its current form, is only really useful if you have requests that last longer than a second (at minimum—I usually do five or ten-second intervals). What kind of workload do you see on your servers? Is delta mode as it exists today something that you’re making use of (or that you’ll start making use of, now that I’ve written a post about it)?
This post is part 25 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Yesterday’s post was about configuring the output based on what you want to see. Today’s post is about taking that perfect output and persisting it.
There are many reasons that you might like to store the results of a Who is Active call to a table. Some of the real use cases that I’ve been told about by Who is Active users include:
- Scheduled monitoring. Capturing the results of Who is Active calls in 5 or 10 minute intervals to see what’s happening on the database server throughout the day (or night)
- Using it as part of a build process, to verify that the correct things are happening in the correct order at the correct times
- Using it as part of an exception handling system that automatically calls Who is Active every time an error occurs, to snapshot the current state of the database instance
And there are various other use cases in addition to these. The point is that there are a number of reasons that you might want to capture the output.
Unfortunately, it’s not as simple a task as you might think. The first time I tried to make it work, I did something like:
CREATE TABLE #output
This approach failed miserably. If you try it, as I did, you’ll see the following error message:
Msg 8164, Level 16, State 1, Procedure sp_WhoIsActive, Line 3086
An INSERT EXEC statement cannot be nested.
Who is Active uses a number of INSERT EXEC statements, and they cannot be easily changed or removed, so for a while it seemed like all was lost. After a bit of brainstorming, however, I realized that I could simply build yet another INSERT EXEC into Who is Active—one that will insert into a table of your choice.
Of course, first you need a table. And if you’ve been reading this series you’re no doubt aware that the output shape returned by Who is Active is extremely dynamic in nature, and depends on which parameters are being used. So the first option I added was a method by which you can get the output schema. Two parameters are involved: If @return_schema is set to 1, the schema shape will be returned in an OUTPUT parameter called @schema. This is best shown by way of example:
DECLARE @s VARCHAR(MAX)
@output_column_list = '[temp%]',
@return_schema = 1,
@schema = @s OUTPUT
The idea is that you set up your Who is Active call with all of the options you’d like, then bolt on the @return_schema and @schema parameters. Here the column list is being restricted to only those columns having to do with tempdb. If you run this code, the SELECT will return the following result:
CREATE TABLE <table_name> ( [tempdb_allocations] varchar(30) NULL,[tempdb_current] varchar(30) NULL)
This result can be modified by replacing the “<table_name>” placeholder with the name of the table you actually want to persist the results to. Of course this can be done either manually or automatically—after the call to Who is Active, the text is sitting in a variable, so a simple call to REPLACE is all that’s needed. That call could even be followed up by a call to execute the result and create the table...
DECLARE @s VARCHAR(MAX)
@output_column_list = '[temp%]',
@return_schema = 1,
@schema = @s OUTPUT
SET @s = REPLACE(@s, '<table_name>', 'tempdb.dbo.monitoring_output')
Of course now you probably want to put something into the table. Crazy! To do this, drop the @return_schema and @schema parameters and replace them with @destination_table—the name of the table into which the results should be inserted:
@output_column_list = '[temp%]',
@destination_table = 'tempdb.dbo.monitoring_output'
Now the results of the call will be inserted into the destination table. Just remember that every time you change the Who is Active options, you’ll have to re-acquire the output shape. Even a small change, such as adding an additional column to the output list, will result in a catastrophic error.
@output_column_list = '[session_id][temp%]',
@destination_table = 'tempdb.dbo.monitoring_output'
Msg 213, Level 16, State 1, Line 1
Column name or number of supplied values does not match table definition.
How far you take this feature depends on how creative you are. Some of you have come up with elaborate schemes, but I generally keep it simple. Something that I like to do is to set up a short semi-automated process by using Management Studio’s GO [N] option. I use this when I’m doing intense debugging, and will do something like:
DECLARE @s VARCHAR(MAX)
@format_output = 0,
@return_schema = 1,
@schema = @s OUTPUT
SET @s = REPLACE(@s, '<table_name>', 'tempdb.dbo.quick_debug')
@format_output = 0,
@destination_table = 'tempdb.dbo.quick_debug'
WAITFOR DELAY '00:00:05'
This will first create a table in tempdb, after which it will collect the results every five seconds for a five-minute period. I set @format_output to 0 in order to get rid of the text formatting so that I can more easily work with the numeric data. The results can be correlated to performance counters or other external information using the [collection_time] column, which was added to Who is Active specifically to support automated data collection.
Share your experiences with Who is Active’s data collection feature. How are you using it? Have you hit any problems or roadblocks? Any awesome success stories? A few sentences will be great, and more is fine if you’re in the mood to tell a story.
This post is part 24 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Early in the process of creating Who is Active it became clear that there was no way everyone was going to agree.
With each release I received requests to move some column so that it would show up on the lefthand side, or to change the sort order, or to make some other modification that someone felt was necessary to help them more easily digest the data. The problem: it was impossible to accommodate all of these requests. So I decided to go with a self-service model.
In order to allow users to dynamically customize the output (without touching any code), the stored procedure exposes three options: @output_column_list, @sort_order, and @format_output. Each of these is discussed in the following sections.
@output_column_list controls not only whether or not certain columns will be shown in the output, but also the order in which the columns are displayed. The correct argument is a list of bracket-delimited column names (or partial names with wildcards). Delimiters are not necessary (use whatever delimiter you like, or none at all; they’re ignored). The key to successfully using this option is to remember that inclusion of columns in the output is additive: many columns (such as [additional_info]) are only added to the output if both the correct options are enabled for the stored procedure and the columns are included in the column list. If you start modifying the list and don’t take this into account, you may not see the columns you’re expecting when you go back later and start changing options. To keep things flexible, make sure to use wildcard columns, especially a generic wildcard ([%]) at the end.
Using the column list can be as simple as specifying the exact columns you’re interested in:
@output_column_list = '[tempdb_allocations][tempdb_current]'
Notice that no delimiter is used here. Again, any extraneous text aside from the column specifiers is ignored, so the following call is equivalent:
@output_column_list = 'this[tempdb_allocations]is[tempdb_current]ignored'
Easier than specifying exact column names is to use wildcards that match the pattern of the columns you’re interested in:
@output_column_list = '[tempdb%]'
All three of these calls will yield the same output, similar to the following:
Of course, this will return only these two columns. Generally I’ll use the column list feature just to move things around so that I don’t have to do as much scrolling, and in those cases I almost always want everything else, too. That’s where the generic wildcard comes into play:
@output_column_list = '[tempdb%][%]'
Now the tempdb-specific columns appear on the lefthand side, and all of the other columns follow, in a default order.
The next option we’ll cover is @sort_order. This option controls the order of the rows output by the stored procedure. Like @output_column_list, the input is a list of bracket-delimited column names. In this case, wildcards are not supported, but the keywords ASC[ENDING] and DESC[ENDING] are supported after the column name.
The following call returns data sorted by [login_name] ascending (ascending is default, so the keyword is optional), with ties broken by [session_id] descending:
@sort_order = '[login_name][session_id] DESC'
As before, delimiters are optional and are ignored. Please note that the current betas of Who is Active have bug where multi-column sorts like this one are not properly honored. That issue will be fixed in the next version of Who is Active.
Sorting is especially useful when doing any kind of comparison of the various requests currently running. I use it extensively in block leader mode ([blocked_session_count] DESC), and it is quite handy in delta mode—which will be covered in a post in a few days.
The final option this post will cover is @format_output. This one is based on the fact that Who is Active is designed to leverage SSMS as a “graphical user interface” of sorts. Query text is sent back in an XML format, numbers are formatted as right-justified strings, and elapsed time is formatted in the more easily-digestible [dd hh:mm:ss.mss] form.
If you’ve been using Who is Active for a while you’ve probably noticed the right-justified numbers:
This format was suggested fairly early on by Aaron Bertrand, and is one of my favorite things about the stored procedure. It makes it much easier to pick out bigger numbers when you’re looking at a large set of data. By default, SSMS uses a non-fixed width font for grid results, so the default argument to @format_output, 1, takes this into consideration. But some people—like Aaron Bertrand—change the SSMS settings and use a fixed width font instead. If you’re one of these people you can use an argument of 2 to get nicely-formatted numbers. If you don’t change the argument you might notice that the numbers don’t seem to properly line up when you’re working with a set of data containing numbers of greatly differing size.
Still other people don’t like the right-justified numbers or are doing collection to a table (see tomorrow's post), and so formatting can be completely disabled by using an argument of 0. It’s up to you...
Today’s homework is to enjoy your Sunday. You’re a busy DBA; you’ve earned a break. Get outside and off of the computer!
This post is part 23 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Oftentimes blocking situations are a bit more complex than one session blocking some other request. In busy systems blocking chains can build up that include dozens or even hundreds of requests, all waiting on one another. And all are waiting as a result of one top-level blocker: the block leader. In many cases fixing the blocking problem means fixing whatever it is that the block leader is doing, so identifying it quickly is a key part of debugging blocking situations.
Finding the block leader is a fairly simple process once you realize that blocking is effectively a hierarchy. The process involves starting with all of the requests that aren’t blocked and walking down the blocking chain until the leaf level is reached—blockees that lack any more downstream blockees. At each level, a number is recorded to figure out the blocking depth. As an added benefit, a second pass can be made to reverse the number at the end of the process—this shows the total number of downstream blockees for each blocker.
While this is relatively easy to implement using a Common Table Expression, it’s certainly not something that users should have to reinvent each time it’s needed. So Who is Active exposes an option, @find_block_leaders, that does the work for you. This option adds a new column to the output, called [blocked_session_count], which reflects the total blockee count. Higher numbers mean more sessions blocked downstream; the sessions with the highest numbers are your block leaders, and these are the ones you want to focus on.
To see this in a bit more detail, run the following batch in four sessions:
UPDATE TOP(10) Sales.SalesOrderDetail
SET OrderQty += 7;
Assuming that nothing else is running, the first session will complete. The other three will block. Who is Active reports this, of course, even in its default mode:
The initial update was run on session 53, which is blocking session 54. Both 55 and 56 are reported as blocked by 54, although in reality they’re being blocked indirectly by 53. This case, while more complex than most of the blocking examples used in this series, is still simpler than many of the things seen on average production systems. None the less, it’s enough to show the power of the Who is Active option that this post is about...
@find_block_leaders = 1
The [blocked_session_count] clearly shows which session is causing the issue in this case: 53 has three downstream blockees, whereas 54 has only two, and the other two sessions have none.
In this case, because I happened to run the batches in the correct order, the data is returned by default with the block leader sorted on top. That may or may not always be the case in a real system, but it’s quite a useful thing when there are numerous active requests and you want the most important ones right at the top. To accomplish that, use the output ordering feature—which will be covered in detail in tomorrow’s post.
@find_block_leaders = 1,
@sort_order = '[blocked_session_count] DESC'
Today’s “homework” is a question for you to consider about the behavior of Who is Active with regard to sleeping sessions: Today the [start_time] and [dd hh:mm:ss.mss] columns, for sleeping sessions, correspond to the login time for the session. But recently I’ve been thinking that it might make more sense to instead show the amount of time since the last request completed—the amount of time that the session has been sleeping. This seems to me to be more useful information and more in line with the goal of the stored procedure.
Which would you rather see, and why? I would appreciate any input and will carefully consider it. This would be a fairly major change, and it is an important decision either way.
This post is part 22 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Note: Before reading this post, please download the most recent Who is Active build, which includes a key fix for the locks mode.
Blocking has been a theme of a couple of recent posts in this series. And that’s not even the end of it. Helping you find and properly evaluate blocking issues is a core part of Who is Active’s raison d'être.
Behind every block is something causing the block. Something that, without which, there could be no block. And that thing is called a lock.
Locks are synchronization objects. Their mission in life is not to give you headaches when dealing with blocking issues, but rather to help maintain the ACID properties that are a big part of the reason that DBMS technology is so popular. For locks in particular this means the “I” property: Isolation. (And, to a lesser extent, the "C" property: Consistency.) Locks keep readers from reading data that writers haven’t finished writing, and they keep writers from overwriting data as it’s being read or written by someone else. This is a good thing. Failure to take these kinds of precautions would result in chaos. And a lot of really bad data.
When a DBA sees blocking, her first instinct is to eliminate it. Kill the blocker! Add a NOLOCK hint! Change the processes around! But blocking is not necessarily a bad thing. Blocking means that your data is being protected. Readers are getting consistent results. Writers aren’t overwriting each other. Everything is as it should be—even if your queries are waiting an inordinate amount of time for data.
When you see blocking, the correct move is not to eliminate it, but rather to evaluate it. Figure out what’s causing the blocking. Figure out why (or whether) it’s necessary, and what the alternatives might be. Then—and only then—should you start killing sessions, adding hints, or taking similar action.
Evaluating blocking can be a painful experience. The sys.dm_tran_locks view (formerly syslockinfo) contains a large number of columns. Many of these are numeric values that need to be referenced elsewhere in order to be meaningful to the average human. And even then, it’s simply not a very nice user experience...
Glancing at this list of lock information, it’s impossible to tell what’s going on. (The query I ran to pull up this list returned 2700 rows.)
Who is Active solves this problem by putting locks into a somewhat more human-readable form: a custom XML format. The stored procedure does all of the work of going to the databases with locks and decoding the numbers. So instead of seeing something like 72057594038845440, you’ll see something like Sales.SalesOrderHeader. Whether or not you think that XML in general is a very readable format, the fact that the various object names have been resolved for you makes it a lot better than a straight query against sys.dm_tran_locks.
To get lock information, use Who is Active’s @get_locks = 1 option. This will add a column called [locks] to the output. The column is typed as XML, and you can click on it to see the full contents. The document will have one root node per database in which there are locks. For the table listed above, the collapsed nodes look like this:
Under each database node is one node that represents locks on the database itself, and a node called Objects that contains subnodes for each object in the database that’s locked. These are grouped by object name and schema name:
Any given object can have multiple types of active locks issued against it at one time, so inside of the Object nodes are one or more Lock nodes:
This format allows for quick and simple exploration of the various locks that are active on behalf of your session. Each Lock node has an attribute called request_status. If its value is “GRANT,” the lock is held by the session. If its value is “WAIT,” the request is waiting to acquire the lock.
A full description of the various lock types is well beyond the scope of this post, but most of them are documented in the Books Online entry for sys.dm_tran_locks.
A cautionary note: Using the @get_locks option can seriously slow down Who is Active. The sys.dm_tran_locks DMV is known to be one of the slowest DMVs, and in some cases it can hold a huge number of rows. I have seen numerous cases where a simple SELECT * against the DMV took 20 or more minutes to finish. When dealing with locks, which can change rapidly, that’s far too much elapsed time for the results to be meaningful. Recent Who is Active builds include blocked object resolution mode, which is designed to be a much lighter weight alternative to using the full locks mode.
Using the various DMVs it’s possible to write a number of queries that can deeply analyze block situations. Although Who is Active already does this, it’s an interesting exercise. Can you write a query that shows all blocked requests, the lock mode for each request, the blocker session or request that the blockees are waiting on, and the blocker lock mode?
This post is part 21 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Tempdb. Everyone’s favorite shared bottleneck.
The funny thing about tempdb is that it’s not used by every query. It’s only really used by the biggest queries. The queries where performance really matters. And of course, that makes the situation all the worse. When tempdb is a problem, it’s a major problem.
A common cause of tempdb issues is latch contention. This occurs due to three “special” page types: GAM (Global Allocation Map), SGAM (Shared Global Allocation Map), and PFS (Page Free Space). Each of these pages (sometimes many of each) exist in all database files, and they are responsible for helping to identify where incoming data can be written in the physical file. For complete background, read this post by Paul Randal.
Whenever a process needs to update one of these special pages, a latch is taken. A latch can be thought of as a kind of very lightweight lock; it’s designed for synchronization and is intended to be quickly turned on, and then just as quickly turned off after it’s no longer needed. The problem occurs when lots of processes happen to hit the SQL Server instance at the exact same time, and they all need to find a place to store some data. Suddenly, there are numerous outstanding latch requests. They start queuing up, and before you know it the last one in line has been waiting for seconds—and not doing any work at all in the meantime.
Seeing this in action is quite simple. Create a bunch of temporary tables, in a very short period of time. I like to demo this using SQLQueryStress. Ideally you should do this kind of test on a server with only a single tempdb data file, to really highlight the issue. Here’s the code to run:
20 or so concurrent threads should do it. Hit the start button in your load tool, wait a few seconds, run Who is Active, and you’ll see something similar to:
See those PAGELATCH waits? They’re all on the same resource: PFS pages, in tempdb file ID 1. The format for PAGELATCH and PAGEIOLATCH waits is: [wait_type]:[database_name]:[file_id](page_type). Who is Active can decode the page types for GAM, SGAM, PFS, DCM (Differential Change Map), and BCM (Bulk Change Map) pages. For any other page, the page type will be an asterisk (*).
These waits are all on update (UP) latches, but it's also quite common to see exclusive (EX) latches when this problem occurs
Fixing this problem is amazingly simple: just create more tempdb files! When you create additional tempdb files—as long as they’re equally sized—SQL Server will automatically balance incoming requests across the multiple files. Since each file maintains its own PFS, GAM, and SGAM information, the contention will be eliminated.
How many files should you create? Well, that depends on your workload. I take a simple approach: start with a number of files equal to one quarter to one half of the number of logical schedulers. If the contention goes away (i.e. Who is Active is no longer showing PAGELATCH waits on these special pages in tempdb), stop there. Otherwise, keep increasing the number of files until the contention does go away.
Again: make sure to keep the files equally sized! SQL Server’s algorithm is based on a proportional fill model, which means that if one file is bigger than the others it will be chosen more often. This will cause it to grow more quickly, which will cause it to be chosen more often, which will cause it to grow more quickly, which will keep compounding until your disk is full and your performance is back in the red zone. Not a good situation to get yourself into. Grow the files bigger than you need them at create time, and monitor to make sure things don’t get out of hand.
Use Who is Active to check your production servers for tempdb contention! Tell me in the comments below whether you found any. It’s amazing how common this issue is, yet how simple the fix turns out to be. A very satisfying task for even the most harried of DBAs.
This post is part 20 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Query plans are packed with information about what’s going to happen, or what has happened. But they’re markedly quiet about what is happening.
This is probably best illustrated with an example. Consider: you’re sitting there at your desk one morning, enjoying a nice hot cup of coffee, when the phone rings. It’s Steve, the harried middle manager who sits on the 2nd floor. “Help!” Steve shouts into the phone, making you flinch and immediately move the headset away from your ear. “My report has been running for 10 minutes! I have a meeting in three minutes! Is it going to finish?”
You calmly open Management Studio, fire up Who is Active, and glance at the various columns. No blocking. No extreme waits. Nothing too interesting in the plan. Just a big query chugging along.
At this point there’s really not much you can do. But there is one hint that might help give you some more insight. Who is Active collects, along with CXPACKET waits, a node identifier. These are displayed along with the wait when @get_task_info = 2 is used. This can help you figure out approximately what your plan is up to right now.
Too see this in action, first create a blocking situation so that we have something to look at:
UPDATE TOP(10) Sales.SalesOrderDetail
OrderQty += 7
Next, in a new window, fire up the following query:
PARTITION BY sd.SalesOrderDetailId
ORDER BY sd.ProductId
) AS r
) AS sd
INNER JOIN Sales.SalesOrderHeader AS sh ON
sh.SalesOrderId = sd.SalesOrderId
) AS s
s.r = 1
Assuming that you have a multicore machine, the plan for this query will have a few different parallel sections. Which is what we need for CXPACKET waits. Let’s see what Who is Active has to say about the situation:
@get_task_info = 2,
@get_plans = 1
The query has 10 tasks. One of them waiting on a lock; we’ll ignore that one since we’ve created the lock to “pause” things in this case. The other nine are all waiting on CXPACKET waits—but there are two different groups. The first group consists of eight tasks, all waiting on node 17. The second group has only a single task, and it’s waiting on node 0.
So what are these nodes? Bring up the plan, hover over any of the iterators therein, and you’ll see:
Every iterator in the plan has an associated node identifier. These range from 0—for the leftmost node in the plan—on up, as you move to the right. CXPACKET waits are associated with exchange (parallelism) iterators, and the wait information includes the actual node identifier from the plan. By using this information you can begin to understand the flow of data through your larger plans, as task waits move from node to node.
This is certainly not a perfect solution, and chances are very good that you won’t be able to give Steve from the 2nd floor the exact amount of time remaining. But every bit of information helps, and in this case you may be able to come up with a reasonable estimate. Even if he ends up running to the meeting with only half of the report printed.
Yesterday’s task was to cause an error to occur in Who is Active’s blocked object resolution mode. Easily enough accomplished by creating a permission issue:
CREATE LOGIN active_error
WITH PASSWORD = 'abcd1234!!!!'
GRANT VIEW SERVER STATE
EXECUTE AS LOGIN='active_error'
@get_task_info = 2,
@get_additional_info = 1
Running this will cause a new node to appear inside of <block_info> in the [additional_info] column:
<query_error>The server principal "active_error" is not able to access the database "AdventureWorks" under the current security context.</query_error>
Today we'll take a break from the homework to celebrate the middle of the work week and two-thirds of this thirty-part series behind us.
This post is part 19 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Debugging a blocking situation can be a complex process. First you need to figure out who’s doing the blocking, and who’s getting blocked. Next—before you can actually debug things—you need to figure out what the lock is that’s actually causing the blocking to occur.
This second phase has been known to cause many a DBA to rip out numerous hairs. First comes a visit to the locks DMV, followed by an extended period of research. What is a [resource_associated_entity_id]? Does that have any bearing on the [request_owner_lockspace_id]? And do you even care? If you’re anything like me, most of the time you just want to move on with life.
Who is Active solves this problem in two distinct ways, the first of which I’ll cover in today’s post, and the second a few posts down the road. Which method you choose depends on how much context you need to solve the problem at hand, and how much pressure you’re willing to put on your SQL Server instance. Today’s method is much lighter weight in terms of resource consumption and, in many cases, elapsed time.
Creating a blocking situation is easy enough... In one window do:
UPDATE TOP(10) Sales.SalesOrderDetail
OrderQty += 7
And in a second:
Who is Active figures things out easily enough in its default mode:
...but what if things weren’t quite so cut-and-dry and you needed a bit more information? In this case, you could enable two options that have been covered in recent posts: @get_task_info = 2 and @get_additional_info = 1. Who is Active uses these two options together to populate the [additional_info] column with information about what object is actually causing the blocking situation to occur:
@get_task_info = 2,
@get_additional_info = 1
Clicking on the [additional_info] column for the blocked session reveals...
...full information about the blocked object. This particular lock is a page lock on the Sales.SalesOrderDetail table (go figure). Information can be resolved for virtually all types of locks, and generally speaking this mode does the work extremely quickly, so it should not add a lot of overhead to your monitoring sessions.
Like other features in Who is Active that need to visit various user databases, this mode uses an exception handler in case things don’t go as planned. This will display an error message in a child node of <block_info>. Can you figure out how to make an error message appear using the shortest possible script?
This post is part 18 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
Sometimes you just need more.
With over 20 columns in the default output plus several more than can be dynamically enabled and disabled, Who is Active was already overwhelming enough for certain users. But requests kept pouring in for various additional information—metrics to help debug trickier situations and edge cases.
Rather than cluttering the output, I decided to create a single, special-purpose column for everything that’s not quite important enough to be on its own in the output. The [additional_info] column is an XML column that returns a document with a root node called <additional_info>. What’s inside of the node depends on a number of things, but by default you can expect to see:
Rather than repeat the documentation, I’ll point you to the BOL entry for sys.dm_exec_requests for information about what all of these mean. Most of them are various settings that can be manipulated by a given user, batch, or stored procedure. They impact the results of a query and, in some cases, its plan. So it’s a good idea to be able to pull them up when needed.
Beyond these, the [additional_info] column might also contain various other pieces of information, depending on which options are selected and what happens to be running. For example, if a SQL Agent job is running [additional_info] will be populated with:
- job_id: the identifier for the job in MSDB
- job_name: the name of the job, from MSDB
- step_id: the identifier for the job step in MSDB
- step_name: the name of the job step, from MSDB
- msdb_query_error: included when an error occurs that renders Who is Active unable to resolve the job and step names
Today’s post is just a quick overview; I’ll cover other things you can expect to see in [additional_info] in a later post. In the meantime, how do you get all of this information? Simple:
@get_additional_info = 1
What information would you like to see added to the [additional_info] column? Let me know in the comments below. I’ll carefully consider every suggestion but there are no guarantees; remember, too much data is just as bad as not enough (and maybe worse). I plan to be just as careful with [additional_info] as I am with the rest of the columns output by the stored procedure.
This post is part 17 of a 30-part series about the Who is Active stored procedure. A new post will run each day during the month of April, 2011. After April all of these posts will be edited and combined into a single document to become the basis of the Who is Active documentation.
There you are, minding your own business (and that of everyone else on your server) checking out activity with Who is Active.
Some stored procedure is running for what seems like an excessive amount of time—but you’re not quite sure. Should you take a closer look?
The answer can only be my favorite two-word phrase: It Depends.
If a system has been live for quite some time and users are happy enough with performance, I tend to not get too caught up in proactively trying to find things to tune. If, on the other hand, users are complaining, or something has just started taking a lot longer than it used to, then it’s time to step in and do some tuning.
Today’s post is about helping you figure out whether you’ve identified one of these latter cases: a query that was running fine, but is suddenly not doing so well. This can happen due to any number of root causes, the most common of which are outdated statistics and parameter sniffing. Actually fixing the problem is beyond the scope of this post, but identifying it is half the battle.
Who is Active’s @get_avg_time option is designed for this scenario. When enabled, a new column is added to the Who is Active output, [dd hh:mm:ss.mss (avg)]. This column reflects the average run time—if it’s available—of the statement that your request is currently working on. The idea is that if you’re not sure whether the current run time is an anomaly, you can compare it to how long previous runs have taken to figure out whether you have reason for concern. Note that this is only useful if the server happens to have cached statistics about prior runs. If the cache has been cleared recently due to a reboot, memory pressure, or someone changing the server configuration options, all bets are off.
To see the feature in action, we can simulate a basic parameter skew issue. Set up the following stored procedure in AdventureWorks:
CREATE PROC #sniff_test
FROM Production.Product AS p
LEFT OUTER JOIN Production.TransactionHistory AS th ON
th.ProductID = p.ProductID
LEFT OUTER JOIN Sales.SalesOrderDetail AS sod ON
sod.ProductID = p.ProductID
p.ProductID BETWEEN @min_ProductID AND @max_ProductID
Now exercise the procedure a few times to get a baseline:
@min_ProductID = 1,
@max_ProductID = 700
Once the baseline has been established you’ll have a nice set of statistics recorded in the cache. Changing the arguments ever-so-slightly will have an enormous impact on the run time:
@min_ProductID = 1,
@max_ProductID = 800
While that’s churning away, bring up a new window and run Who is Active using the @get_avg_time option.
@get_avg_time = 1
The result will be something like the following:
From these results we can see that when I ran Who is Active the procedure had already been running for over two seconds. But the average time is a mere 59 milliseconds. Clearly, something is amiss. (And indeed it is: both an inappropriate plan and a ProductID range with a lot more information than the prior range that was used.)
Based on the number of questions I’ve received—or, more accurately, the complete silence—the @get_avg_time feature hasn’t gotten a lot of attention from many Who is Active users. Admittedly, while I have found it to be useful on a few occasions, I don’t use it much myself. And it is somewhat slow when run on a system with a large number of cached plans. But I do think it’s a worthwhile feature. If you’ve used it, please let me know in the comments below. I would very much like to hear about whether it helped you out, was completely worthless, or anything in-between.