THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and provides consulting and training courses around the world in SQL Server and BI topics.

  • Join Effects With UPDATE

    A lot of people don’t like UPDATE with a FROM clause. I’m going to explore what’s going on, both logically and within the query plan. And I’m doing this for T-SQL Tuesday #74, hosted by Robert Davis (@sqlsoldier). TSQL2sDay150x150[3]

    I’m going to use a fresh copy of AdventureWorks2012. I’m using SQL 2012 SP3, but the things I’m looking at should apply to most versions. I’m going to use Production.Product and Production.ProductSubcategory. The Product table has a ProductSubcategoryID column with a foreign key in place, although this column allows nulls, as not every product must be within a subcategory.

    Our standard UPDATE query doesn’t have a FROM clause. It’s just “UPDATE ... SET ... WHERE ...”, and if we need to hook into other tables, we use sub-queries. Let’s look at why this is significant.

    The WHERE clause filters rows. A sub-query in the WHERE clause still only filters rows, either by being a scalar expression used in one of the predicates, or being a single-column table expression used in an IN predicate, or a table expression used in an EXISTS clause. Any other tables used in sub-queries in the WHERE clause can only be used to help filter the table being updated – they can’t affect the SET clause at all, or cause a row to be updated multiple times.

    Some examples are like this:

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    WHERE ProductSubcategoryID = (SELECT s.ProductSubcategoryID FROM Production.ProductSubcategory s WHERE s.Name = 'Tights');

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    WHERE ProductSubcategoryID IN (SELECT s.ProductSubcategoryID FROM Production.ProductSubcategory s WHERE s.ProductCategoryID = 4);

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    WHERE EXISTS (SELECT * FROM Production.ProductSubcategory s 
                  WHERE s.ProductCategoryID = 4 
                  AND s.ProductSubcategoryID = Production.Product.ProductSubcategoryID);

    Using other tables in the SET clause generally means something that returns a scalar value, although this could become more complex using CASE. Still though, the logical impact on the overall query is notable. Something in the SET clause cannot be used to filter the values being updated, or to update a value multiple times. The SET clause is a list of “column = <scalar expression>” clauses, in which each column must come from the table (or table expression) being updated (which means I don’t think it should ever have a table alias), and cannot be listed multiple times. As the expression is scalar, it can’t produce multiple rows or columns.

    UPDATE Production.Product
    SET DiscontinuedDate = (SELECT MAX(s.ModifiedDate)
                            FROM Production.ProductSubcategory s 
                            WHERE s.ProductSubcategoryID = Production.Product.ProductSubcategoryID)
    WHERE ProductSubcategoryID IN (SELECT s.ProductSubcategoryID FROM Production.ProductSubcategory s WHERE s.ProductCategoryID = 4);

    Now, all UPDATE statements could be written like this. As an update statement cannot change the number of rows in a table, the net effect on any table is a single row of change (I know triggers could be used to have a larger effect, but that’s a separate topic). I’ve met plenty of people over the years who will argue for never using a FROM clause in an UPDATE clause.

    You see, a FROM clause can have more of an effect than these sub-queries.

    Let’s think about what introducing extra tables via a FROM clause can do. For now, let’s start with what’s going on in a SELECT query, when you turn a FROM clause into a FROM ... JOIN clause.

    1. It can let you access data in the columns of those tables, to use in predicates or expressions.

    2. It can let you filter the data in the base table, by only allowing rows which match, such as when using an inner join or right outer join.

    3. It can cause rows in the base table to be returned multiple times, if multiple rows in the joined table match a single row in the base table.

    4. It can introduce NULL rows, if a full or right outer join is being done (or a left outer join with the base table second) and there are rows in the joined table that don’t match any rows in the base table.

    The particular combination of these will affect the type of join performed by your SELECT query – such as a Semi Join which does number 2, but none of the others. And if it does none, then the join is redundant and won’t appear in the query plan at all.

    So how does this work in an UPDATE statement?

    There are two possible ways that a FROM clause can work – one is to include the table being updated in the FROM clause, and the other is to have it NOT included in the FROM clause. If it doesn’t appear in the FROM clause, then predicates to define the matching criteria must be included in the WHERE clause to avoid updating every row. If it does appear in the FROM clause, then I would recommend using the table alias in the UPDATE clause rather than the name of the table itself.

    (Interestingly PDW does not support “UPDATE ... FROM ... JOIN”, although “UPDATE ... FROM ... WHERE” is fine.)

    So this is fine:

    UPDATE Production.Product
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.ProductSubcategory s
    WHERE s.ProductSubcategoryID = Production.Product.ProductSubcategoryID;

    As is this:

    UPDATE p
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID;

    But please be careful about:

    UPDATE Production.Product
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID;

    It works, but I don’t consider it safe. Because you have the potential to update a table which isn’t mentioned in the FROM clause, you could find yourself inadvertently updating every row in Production.Product. There are safeguards to prevent it happening – this next example gives an error:

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    FROM Production.Product p
    WHERE Production.Product.ProductSubcategoryID IS NOT NULL;

    , although this one doesn’t, and updates every row in the table – after all, we have a CROSS JOIN going on, because I’ve listed the wrong table.

    UPDATE Production.Product
    SET DiscontinuedDate = SYSDATETIME()
    FROM Production.ProductSubcategory p
    WHERE p.ProductSubcategoryID IS NOT NULL;

    If I’m writing queries, it’s generally fine. But if there’s a system which produces dynamic SQL, I start to worry. I’d rather update the alias, and be completely clear about what’s going on.

    So let’s go with the idea of using the table alias in the UPDATE clause when using the FROM clause, and choosing to always include the table being updated in the FROM clause. Unless we’re using PDW, of course.

    But the impact of those join effects... let’s look at them.

    Earlier, we saw this query. An inner join between Product and ProductSubcategory.

    UPDATE p
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID;

    Now, because s.ProductSubcategoryID is known to be unique (it’s the primary key on s), there is no way that this can cause ‘multiple updates’ to Product. Things are okay here, but filtering could certainly apply. A join is done to get the values from ProdcutSubcategory, and the rows are fed into the Clustered Index Update operator.

    image[12]

    Filters are okay here. UPDATE is happy with filters, whether they’re implemented using the WHERE clause or via an ON clause.

    But what if the unique index weren’t there? Then we might see duplicate rows – the next join effect.

    ALTER TABLE [Production].[Product] DROP CONSTRAINT [FK_Product_ProductSubcategory_ProductSubcategoryID];
    ALTER TABLE [Production].[ProductSubcategory] DROP CONSTRAINT [PK_ProductSubcategory_ProductSubcategoryID];

    Now what does the plan look like – we should see a Table Scan instead of a Clustered Index Scan because we just dropped the PK, but what other differences?

    image[16]

    It looks very similar, but now throws a Distinct Sort in there. You see, an Update isn’t going to do multiple updates. It won’t allow it. So it does a Distinct Sort on the PK of the Product table, and uses whichever value it cares to for the update.

    Another option it could’ve used would’ve been to use an Aggregate operator (because GROUP BY and DISTINCT are essentially the same thing), in which case it would’ve needed to apply an aggregate function to s.ModifiedDate while grouping by the Product PK. Which aggregate? The ANY() aggregate, of course – because it doesn’t care which value to use, it just has to be a valid one. I can get this plan by using an OPTION (FAST 1) query hint, because that will avoid doing the Sort, as a Sort is blocking. It also turns the Hash Match into a Nested Loop, because it really wants to get that first row through as quickly as possible. It’s a slower query, but lets us see the ANY() aggregate.

    image[25]

    So we can see that if a multiple rows are going to be returned by the FROM clause, this will get shrunk down to a single one. This is how that third ‘join effect’ is handled.

    Be really careful about this. It’s a bad thing, and the reason why purists don’t like to see a FROM clause in an UPDATE statement.

    The next (and final – yay!) join effect is to have NULLs introduced.

    Let’s start by putting those constraints back in:

    ALTER TABLE [Production].[ProductSubcategory] ADD  CONSTRAINT [PK_ProductSubcategory_ProductSubcategoryID] PRIMARY KEY CLUSTERED
    ([ProductSubcategoryID] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY];
    ALTER TABLE [Production].[Product]  WITH CHECK ADD  CONSTRAINT [FK_Product_ProductSubcategory_ProductSubcategoryID] FOREIGN KEY([ProductSubcategoryID])
    REFERENCES [Production].[ProductSubcategory] ([ProductSubcategoryID]);
    ALTER TABLE [Production].[Product] CHECK CONSTRAINT [FK_Product_ProductSubcategory_ProductSubcategoryID];

    ...and let’s put a RIGHT JOIN in there (for purely academic reasons – I know you’d never do this in real life, although you might put a LEFT JOIN in with the base table second). This means that our FROM clause will return an extra row for each ProductSubcategory that has no Products. There aren’t any of them in AdventureWorks2012, but the Query Optimizer doesn’t know that.

    Before I go any further, let’s quickly make something clear. With a right outer join, the result set of the join contains rows that don't exist in the base table. Obviously we can't update those – there's nothing in the base table for those rows. But we’re going to look at how the query plan handles this situation.

    UPDATE p
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    RIGHT JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID;

    There are two things of note here in the query plan (apart from the fact that it has put the base table second and used a left join):

    image[29]

    For a start, we still see a Distinct Sort! I can assure you that the unique constraint is in there. If I remove the keyword ‘RIGHT’ I go back to my original version without any distinctifying operator. But the thing that’s new here is that Filter. Let’s look at what the Filter is doing:

    image[36]

    It’s filtering on “[IsBaseRow1001] IS NOT NULL”. That value is coming from the Product table, and is simply checking to see whether the row coming out of the Join operator is a match or not. It’s testing to see if we actually have a row in the base table to update. It could’ve tested ProductID for NULL for this, like we would in a query, but I guess it’s quicker to test IsBaseRow for being NULL than to test ProductID. I don’t know much about IsBaseRow1001, but I can tell that it’s not more than a single byte. The Estimated Row Size on the Scan of Product was 15B in the original query, and is 16B in this query. But I’m just guessing here. Theoretically it’s not needed at all, of course, and for testing, could have been a single bit.

    Or the Query Optimizer could have turned the join into an inner join. After all, we’re not interested in updating a non-row. As much as it’s interesting to see IsBaseRow1001 coming through, I can’t help but think that turning that join operator into an inner join would’ve done the trick. But as we don’t see LEFT JOIN + NULL being turned into an Anti Semi Join either, I’m not too surprised that this translation isn’t happening either.

    Because there could be multiple Subcategories without Products, there is a possibility of non-unique ProductIDs – the NULLs – coming out of the Join operator. But these are the only ones that could be duplicates, because each Product has at most a single matching ProductSubcategoryID in s. Therefore, once the NULLs have been removed by the Filter, the QO should be able to know that the data coming out of the Filter is unique on ProductID, but it doesn’t use this information, and needs a distinctifying operator to be sure.

    The Distinct Sort is still on ProductID, but a FAST 1 hint turns it into a Hash Aggregate this time instead of a Stream Aggregate. The reason for this is that a Nested Loop over the Product table isn’t going to find the ProductSubcategories that don’t have Products (although it could if it understood the translation to Inner Join). Therefore, it still performs a Hash Aggregate, does the filter for IsBaseRow1001, and then does a Hash Match (Flow Distinct) on ProductID.

    It’s interesting to see that we have a Build Residual here on ProductID, despite ProductID being an integer.

    image[47]

    You see, normally in a Hash Match on an integer we wouldn’t see a residual because the hash function produces an integer. It’s because ProductID could have been NULL. The nullability of the column coming through obviously wasn’t change by the Filter (and let’s face it – it didn’t test ProductID for NULL, it tested IsBaseRow1001).

    Quick interlude for showing that the hash function produces a 32-bit value, and doesn’t need a residual check when hashing on a non-nullable integer (while a nullable integer needs more than 32 bits):

    Compare the plans of these two queries. The Hash Match operator in the first one doesn’t have a Probe Residual, because s.ProductCategoryID doesn’t allow NULLs. The Hash Match operator in the second does have a Probe Residual, because p.ProductSubcategoryID does allow NULLs, and a nullable integer can cause clashes in the hash table.

    SELECT *
    FROM Production.ProductSubcategory s
    JOIN Production.ProductCategory c ON c.ProductCategoryID = s.ProductCategoryID
    OPTION (HASH JOIN);

    SELECT *
    FROM Production.Product p
    JOIN Production.ProductSubcategory s ON s.ProductSubcategoryID = p.ProductSubcategoryID
    OPTION (HASH JOIN);

    Also consider the Hash Match operator in the following query:

    CREATE TABLE dbo.rf_BigInts (id BIGINT NOT NULL PRIMARY KEY);

    SELECT *
    FROM dbo.rf_BigInts t1
    JOIN dbo.rf_BigInts t2 ON t2.id = t1.id
    OPTION (HASH JOIN);

    DROP TABLE dbo.rf_BigInts;

    , where we see a Probe Residual on a non-nullable bigint (a lot more than 32 bits). This tells me that bigints can have clashes in the hash table, despite non-nullable integers not showing this.

    Oh yes, we were looking at the Build Residual.

    If we went back to an Inner Join with FAST 1, where we got a Stream Aggregate, and turn that into a Hash Match on the non-nullable ProductID, we can see that our Build Residual disappears.

    ALTER TABLE [Production].[Product] DROP CONSTRAINT [FK_Product_ProductSubcategory_ProductSubcategoryID];
    ALTER TABLE [Production].[ProductSubcategory] DROP CONSTRAINT [PK_ProductSubcategory_ProductSubcategoryID];

    UPDATE p
    SET DiscontinuedDate = s.ModifiedDate
    FROM Production.Product p
    JOIN Production.ProductSubcategory s
    ON s.ProductSubcategoryID = p.ProductSubcategoryID
    OPTION (HASH JOIN, FAST 1);

    image[40]

    Let’s do a bit of a summary...

    If you’re doing an UPDATE, you can only update each row in that table one time, no matter what kind of impact your FROM clause might want to have on the base table. You may feel like it's going to update some rows multiple times, but that's not how it works 

    Each of the join effects is either applied (in the case of a filter) or mitigated (in the case of duplicates or NULLs), so that you can access the data in other tables without fear of having a bad effect on your UPDATE, but don’t do it! Because you can’t tell which value gets picked up by the ANY() aggregate (or first row by a distinctifying operator), you should avoid duplicates completely, so that your UPDATE finds a single value to update each row with.

    And I would encourage you to use table aliases in your UPDATE clause if you use a FROM clause with a JOIN – but if you don’t use a JOIN, then make sure you include the match in your WHERE clause.

    @rob_farley

  • A new superpower for SQL query tuners – Number of Rows Read

    There’s a technique that I’ve been using for some time now, and been looking for going even further back, which has become immediately available to everyone who can see a query plan (version permitting).

    Some years back (ok, it was 2010), I started to present at conferences about SARGability and residual predicates. I had a session at SQLBits VII about it (one of the first times I’d presented with SQL MVPs from at least five different countries in the room), I presented at the 2011 PASS Summit on “The evils of residualiciousness”, and I even wrote a song with the line “my predicate’s residual, my seek just runs too slow”. I wrote blog posts about bad covering indexes, and TSA security probes (or something like that).

    The point of all this is that an index seek, merge join, or hash match is not necessarily effective for quickly locating the rows that you care about. It all comes down to the predicates that are involved, and whether they are SARGable for the index you’re trying to use.

    Over at my “Covering Schmuvvering” post, I describe an index and query on AdventureWorks like this:

    CREATE INDEX rf_ix_Covering ON Production.Product(DaysToManufacture)
    INCLUDE (Name, ProductNumber, Size, ReorderPoint, Color);

    SELECT Name, ProductNumber
    FROM Production.Product
    WHERE DaysToManufacture < 4
    AND ReorderPoint < 100
    AND Color = 'Red';

    The plan gives me an Index Seek that returns a single row. Wohoo!

    image

    ...but I explain how this is actually really bad, because all the work is being done in the Predicate, not the Seek Predicate.

    image

    You see, the “Predicate” is the Residual Predicate (my term – you’ll just see it as “Predicate” here, because it’s the same kind of thing that we see in a Scan, which doesn’t have a concept of the Seek Predicate and the Residual Predicate).

    The Residual Predicate is the leftover one, that needs to be checked after the Seek Predicate. You might not have one, if every predicate in your query is handled already by the time the Seek is done. But if you do have one, there is extra checking to do.

    So just like how a Scan will start on the first page of the index and keep running until it’s either reached the end or it doesn’t need to keep looking (because the operator on its left has stopped asking for more rows), a Seek will find the rows that satisfy the Seek Predicate, and then have to check each one to see if it satisfies any leftover predicates.

    The same applies for Merge Joins, which find things that match using the sort order of each data stream, but then need to apply any residual predicates. Hash Matches can be even worse, as the Probe Residual will include the Probe Keys if the hash function could cause two different values to be assigned to the same bucket. Read more on probe residuals here.

    Back to Index Seeks though...

    When the Query Optimizer creates a plan that has a residual predicate in a Seek or Scan, one of the earlier iterations will have seen this predicate applied in a Filter operator to the left of the Index operation. But before the plan is created, the residual predicate is pushed down into the index operation.

    If we apply trace flag 9130 (undocumented, so be wary), we can see the plan without the pushdown.

    image

    And that thick arrow there shows us 407 rows being produced by the Seek operation, despite the single row being produced in our original plan. That original plan did not contain that information.

    But with Service Pack 3 of SQL Server 2012, all that changes.

    From now on, Index Seeks and Index Scans have an additional property called “Actual Rows Read”, which shows through in SSMS as “Number of Rows Read”.

    image

    Please note that you need to have a recent-enough version of SSMS to see this come through. If you are running SSMS 2014, you’ll need to wait for the next service pack. Or if you’re running SSMS 2012 or SSMS 2016, make sure you’ve applied the latest service pack or CTP.

    This feature is amazing though! A significant part of the query tuning I do is to look for ineffective Seek operations. Many people look for Scans, but don’t consider that a poor Seek is just as bad. Now, we can easily see that this Seek took 407 rows to produce just 1, and a different indexing strategy could be recommended. It wouldn’t surprise me to see warnings start to come through in 3rd party products like SQL Sentry’s Plan Explorer soon as well, because I think that a Seek with 0.25% effectiveness is worth a warning. And without having to use the trace flag either!

    It’s like you’ve just been given X-ray vision on the Index operators!

    Other experts in the space understand the significance of the residual predicate, such as Kendra Little (@Kendra_Little), and now you can see this too.

    image

    There are a few more things I would like to see to round this feature off. I’d like to see Estimated Rows Read (which I can see using 9130), and I’d like to see similar information for Merge Joins and Hash Matches. If you want to see these implemented, jump onto Connect and vote for these items: Estimated Rows Read and Merge Join / Hash Match info – and one to fix up the naming convention.

    ...and in the meantime, make sure your local copy of SSMS is up-to-date, and encourage people to upgrade their SQL 2012 boxes to SP3!

    @rob_farley

  • Does the paradigm fit the environment?

    This month we see the 73rd T-SQL Tuesday come around – six years were completed last month (the first was in December 2009), and this is the start of year seven. It’s hosted this month by Bradley Ball (@sqlballs), and he asks a question about whether our environments are naughty or nice.

    Now, I’m a consultant, and deal with a variety of customers. And I don’t talk about those customers. It’s a thing. I don’t ever want to have a tweet or post where I say “Seriously? I can’t believe my customer has a setup like this!” – because my customers wouldn’t want me to say those things in public, even if I’ve kept the names private.

    Something that I see from time to time though, which really affects the ‘niceness’ of an environment is the paradigm that was used to solve the particular problem.

    20-something years ago, when I was at university, I did a lot of subjects that haven’t obviously impacted my career. Most of them were interesting and fairly useful (although I haven’t really been able to apply the particular nuances of using matrices to rotate the camera in a 3D-modelling environment), but one that really stands out for me as being particularly useful was a subject on Programming Paradigms. We explored functional programming, logic programming, and a few others. I can’t actually tell you the full list of paradigms we explored – I didn’t even attend most of the classes (I had stuff going on which meant I ended up only scraping through – the lowest scoring semester of my whole degree). But the impact that it had on me was an appreciation that the problems we face today shouldn’t necessarily be approached with the same hat that we wore yesterday.

    In the database space, we use a set-based programming paradigm. We apply relational theory to a schema design, and then write queries using set-based logic. This is a useful approach, but it can go too far. When you’re writing queries that you want to perform in particular ways, the focus could be something different. Perhaps you want to create a cursor, looping through each row of a resultset and doing some amount of processing on it. Iterative code, within a set-based environment. It’s a different paradigm, and can turn a nice system into a naughty one, or perhaps even turn a naughty system into a nice one.

    Even within the database space, we have different paradigms to apply. I see data warehouses that try to stick to a normalised design like the underlying transactional environment. I see data warehouses that demand a purely star-schema design. I see parallel systems that haven’t considered distribution theory, and parallel systems which have pushed distribution theory to the nth degree. I see indexing strategies which help, and indexing strategies which don’t.

    Usually, this comes down to the paradigm being applied. It’s generally not too hard to spot when the wrong paradigm has been used, or when a particular paradigm has been pushed too far, but it’s not always easy to quantify and measure empirically. My perspective is that the people involved should feel like things make sense. When sufficiently educated people (people who ask questions rather than blindly accept what they are told) are comfortable with the design decisions, it’s generally not a problem. When they find themselves trying to figure out what’s going on, and why a particular approach to a query has been taken, then there’s an issue. And I don’t care whether that’s a problem with a T-SQL query, or an MDX query, or a piece of ETL – I simply find that if there are experts in the place who wince a little when describing why something is the way it is, then that’s a sign that things aren’t quite right.

    Now, I’ll happily help fight battles to get these things fixed. But as a consultant, I know there are battles worth fighting, and situations worth accepting. And I know that success can be achieved despite things which are less than ideal. But when I think about whether a particular environment is worth getting a lump of coal or a nice elf-created gift, then I often look at the paradigm that was used when the system was designed. Then at least, things will make more sense.

    I hope you all have had a terrific 2015. Why not decide to write a few T-SQL Tuesday posts yourself in 2016?

    @rob_farley 

    TSQL2sDay150x150

  • Anti-Harassment Policies and Codes of Conduct

    I was a director on the PASS board back in 2012 when our having a Code of Conduct was first raised. A number of conferences had experienced bad situations, particularly around sexist behaviour, and it was very appropriate for PASS to take a stand and say “We don’t want this kind of thing to happen to PASS members.”

    We ALL wanted to make sure that the PASS community was a safe community – one which people could be part of without having to worry about whether there would be an “incident”. No one wanted the PASS Summit, or a SQL Saturday, or any PASS-related event, to incur an “incident”. We considered that the only acceptable number of incidents was zero.

    That said, there was a certain amount of awkwardness – particularly in the days leading up to the official discussion about the proposed Code of Conduct. There was a genuine fear about how a Code of Conduct would affect the tone of PASS events. Nobody wanted to be removed from an event because of a seemingly innocuous comment, but even more, no one wanted there to be an incident of harassment. And this fear expressed itself in awkwardness, bordering on flippancy.

    As the globalisation committee (a subsection of the board including some advisors – all of whom knew about the proposed Code) sat around to discuss globalisation, the first time there was a double-entendre, instead of raising an eyebrow or saying “Oh really?” or something else, the expression of the day was “There’s a Code of Conduct violation right there...”. It was a reflection of the nervousness that people felt around what the impact would be. People wanted to maintain the informal atmosphere of the meeting, but didn’t know how to react to a double-entendre in light of the future Code of Conduct – remembering that we ALL wanted PASS to become a safer community for our members.

    We don’t tolerate harassment at all. But at what point do things become harassment? At first it felt like we were trying to define it.

    As an Australian, I see a certain amount of banter about New Zealanders. It goes both ways, and the jokes are apparently very similar. They joke that we treat our sheep in particular ways, and we say the same about them. In the 1980s, the Kiwi Prime Minister Robert Muldoon said that New Zealanders moving to Australia raised the average IQ of both countries, which I think is a fantastic sledge! To suggest that people leaving New Zealand must be less smart than the average Kiwi, but still smarter than the average Australian, is a beautifully crafted rib. Is it racist? By definition, perhaps – but I doubt anyone felt vilified by it.

    “By definition, perhaps” was the phrase that worried me.

    I knew that if we defined the Code of Conduct wrongly, then I, and many others, could easily be in breach of it. I knew that if I reacted to a double-entendre with a raised eyebrow, that could be seen as sexualising a situation. I knew that if I joked that a Tottenham fan in the room was of lower intelligence than an Arsenal fan, then that could be seen as harassment. Maybe not by the Spurs fan, but by someone else watching, who might think that I genuinely insulted them. Even to suggest that a developer has no respect of data (as many PASS presenters might suggest in a session) could be seen as unfairly assigning undesirable attributes to people. It was a concern.

    So instead of raising an eyebrow, instead of reacting to any situation in my usual way, I reacted with “There’s a Code of Conduct violation right there...”. It still achieved the joke, but in a way that acknowledged my fears of what the Code might imply. It wasn’t appropriate, and I’m sorry. The correct thing to do would have been to have just bitten my tongue and ignore it. I also wasn’t the only one in that situation – I think just about everyone in the room did the same.

    We all wanted a policy, but we didn’t know how it was going to affect us.

    As we discussed it, we were able to work out that really what we wanted was not a Code of Conduct that defined what we allowed and what we didn’t allow, because we would never have been able to get that right. What we wanted was to adopt a stance that said “We do not tolerate harassment”, and to have a procedure for what happens if someone feels harassed. What we wanted was an Anti-Harassment Policy.

    Let me express that again:

    We do not tolerate harassment.

    And I don’t want to define what harassment means for an individual. I don’t want to define that certain types of touching are okay and others are not. I don’t want to define that particularly types of eye-contact count as harassment. I don’t want to define the words that can be used to describe body parts (like if someone falls and says they’ve hurt their backside – do they need to be careful about the word they use?), or what counts as “acceptable swearing” at a party. If we define this, then we run the risk that someone might go right up to the defined line in harassing someone, but we haven’t provided a course of action for the victim because the harasser hasn’t broken the “Code of Conduct”.

    I do want to have well-documented processes for how to react if someone feels harassed, because I want the person who feels harassed to know they have a course of action open to them.

    I think a Code of Conduct should be around expected behaviour in particular situations. A Code of Conduct says that a session presenter should wear a collared shirt not a T-shirt. A Code of Conduct says that a sponsor should respect the geographic boundaries of other vendors’ booths. A Code of Conduct shouldn’t say “You must not use someone’s nationality as the subject of a joke” – because when Australia was beaten in the final of the Rugby World Cup, that’s an opportunity to rib them about it, but the principle of standing against racism is incredibly valid. If I suggest that Americans are stupid for considering that “could care less” means the same as “could not care less” – am I crossing the line? It probably depends on a lot of other factors.

    Let me say it again:

    I do not tolerate harassment.

    I simply recognise that what some people see as harassment, others see as friendly banter. Should Bradley Ball, Joe Sack, and Buck Woody be offended about jokes regarding their names? I don’t know. That’s entirely up to them in the situation, and the context of what’s said. Sometimes they might be fine with it, other times they might not. That’s their right. No one else gets to dictate their reaction. Should Kevin Kline have been upset that I sang Happy Birthday to him loudly, in public situations, repeatedly, for a whole day? I try to monitor those situations, and back off if they seem to be getting upset. Is my detector of people’s personal lines sometimes faulty? Sadly, yes.

    I do not tolerate my own harassment of others.

    If you have ever felt harassed by me, I’m personally sorry and honestly regret it. I know I joke. I know I often joke at other people’s expense. But I never mean to harass.

    My personal Code of Conduct varies according to the company that I’m keeping – there are times that it’s okay to point out a double-entendre, but a job interview is probably not that time. My personal Anti-Harassment Policy is not variable. I don’t tolerate harassment, and if you ever feel harassed by me, tell me. If I don’t stop (though I hopefully always do stop), then tell me again, or tell a friend of mine and get them to help me stop (because I have probably misinterpreted you – if I say ‘Oi’ to someone who calls me fat, that doesn’t necessarily mean I’m feeling harassed, even though my extra kilos bothers me and I really don’t like it being pointed out).

    PASS has an Anti-Harassment Policy. As the SQL community, we don’t tolerate it, and we know what to do if someone feels harassed.

    Defining harassment is tough – it’s subjective, and individual. Making a stance to say “we don’t tolerate it” and “if you harass someone, here’s how we will respond” is a good thing.

    Let me say that again:

    The PASS community doesn’t tolerate harassment.

    @rob_farley

  • What’s driving your data model?

    If data modelling were easier, I doubt there would be as many books on the subject, and we wouldn’t have multiple methodologies to consider.

    I’m not going to explore the different methodologies here – that’s almost a religious argument these days, and I am more than happy to let you adopt whichever method you like. Instead, I want to challenge you to think about what is driving your design, and what makes you consider whether it suits your business or not.

    Time and time again I see companies that use software to help them run their business. Sometimes this is an off-the-shelf system or a cloud-based solution; sometimes it’s a bespoke system built by software developers. I’m definitely in favour of using software, and wonder how people operate without it these days.

    ...but how much is your business driven by the software? I see a lot of businesses being led by their software, rather than having the software adapt to the business. For the most part, I’m fine with either. There is a lot to be gained by using systems developed by similar businesses, and taking advantage of lessons learned by others. Letting that software help guide internal processes can be very useful.

    But I don’t think that applies to data models – you should at least consider how much it does.

    I don’t like to write about specific customer situations, so I’m not going to describe a particular anecdote in great detail here. But I want to say that I frequently see environments where the models used within data warehouses don’t describe the business that’s going on – they describe the software that’s used.

    Many data professionals look at a data warehouse as a platform for reporting, built according to the available data sources. I disagree with this.

    The models within a data warehouse should describe the business. If it doesn’t, it’s a data model gone wrong.

    What is the central thing that your business does? What is the main interest point? What do you need to look after? For me, this forms the core of the warehouse.

    The core of the warehouse is not necessarily the main fact table, but could be one of the main dimensions. If you’re a store, do you care about sales, or do you care about customers? The difference is subtle, but might drive some important design elements.

    Two clothing stores might use the same back-end systems for their point-of-sales systems, and might have the same loyalty system set up to persuade people to keep coming back. But one store might have a focus of getting customers back, driving brand loyalty which leads to dedicated fans and word-of-mouth sales. The other store might be more about piquing interest from people walking past the door, and trying to get them to come in and pick up a bargain. Of course, there will be an element of both in both stores, but the culture amongst the staff will be slightly different, as the first tries to identify the customer, tries to make sure that the customer feels cared for, and tries to form a relationship with the customer. It’s less important that the customer buys something, so long as they are going to return. The second sees the customer as a way to get a sale, while the first sees the sale (or even the lack of a sale!) as a way to get a customer. I’m sure you can think of stores in each category.

    It would be very easy to create the same data warehouse for both stores, using a standard retail environment. But are the needs of the stores adequately met?

    There is no doubt that both stores need sales to stay afloat – the retail business requires it. But if your business culture has slightly different concerns to the industry standard, then the data model should cater for that. Perhaps you need a way of scoring customer loyalty, and some path analysis to see what helps a customer reach a particular level of engagement. Perhaps you need to start collecting extra data. Maybe the stores could consider awarding points for simply visiting the store, even if no sales are actually made. Is the person who works from a cafe and buys just one cup of coffee all morning good for business, or bad for business? Can your data model help explore this, or are you designing a system which only handles the data in your transactional system?

    I like to come back to the description of a data warehouse being the single source of truth for an organisation. Many people consider this an issue for data quality – that once data is in the warehouse, it’s trusted and can be used for business analytics. But it should go beyond that. The data warehouse should have transformed the data as kept by the various software packages into data which describes the business, becoming the source of truth about the business. The reports and dashboards across this data should help identify the culture of the organisation, by highlighting the its values and ideals.

    The starting point for a data warehouse design should not be “What are the facts we need to measure?” but rather “What are we about as a business?” – often similar, but occasionally not. Ask what success looks like and what questions will address that.

    Don’t ignore the bottom line, but also don’t ignore what's really important to the business.

    @rob_farley 

    This post was prompted by the seventy-second monthly T-SQL Tuesday, hosted this month by Mickey Stuewe (@sqlmickey).

    TSQL2sDay150x150

  • PASS Summit 2015 WiT lunch

    A regular spot at the PASS Summit is the Women in Technology lunch. This year is no different.

    A few years ago, I was on a panel for discussion at the lunch. The last couple of years though, have changed format, and have an interview focus, with a champion for WiT. This year, Angie Chang is being interviewed about an initiative called HackBright, which helps women form careers in IT. Angie has also been involved with Girl Geek Dinners.

    HackBright has classes which are only women. This is terrific, and men in technology need to understand how important this is.

    My daughter goes to a girls’ school. She can learn there without any prejudice about which activities are suited to boys, and which are more suited to girls. So she learns how to program a robot, she plays cricket, as well as gymnastics, dance, and choir. She sees no differentiation between these things, and will be able to discover skills that she might not have developed if she’d had to compete against boys. I don’t play cricket myself, but I know that in co-ed schools, it’s only really the boys that play cricket or program computers.

    Angie is talking about scholarships that are available through HackBright, sponsored by some of the leading employers in the IT space. Clearly there are companies who have realised the value of technical women, and who want to ensure that their companies are welcoming to women. They talk about increasing the number of women who are getting into IT, but also understanding that if organisations don’t provide cultures that encourage to stay and develop their careers there, then the efforts of companies like HackBright get wasted, and the IT industry doesn’t improve.

    A question has just come in from a guy who has a 16yo daughter in the San Francisco area, and he’s asked if his daughter is welcome to go to the Girl Geek Dinners.

    Naturally, the answer is yes. :)

    @rob_farley

  • PASS Summit 2015 - Keynote 2

    The second day keynote is always a highlight of mine. Until a few years ago there were three keynotes, with the third day including a session from Microsoft Research. Recently this has changed, and the third day keynote rolls into the second. Today I’m expecting some new announcements, some updates on how PASS is tracking, an acknowledgement of the outgoing president, and more. Plus the Microsoft Research session. And again I’m blogging as we go. Also, it’s SQLKilt Day, in support of the PASS Women in IT initiatives to encourage women to consider IT as a career option and to support women in the IT community.

    The information about PASS finances, membership numbers, and reach continue to impress me. I used to be on the board, and was involved in some of the globalisation initiatives. I love that PASS is so much more than North America. While I would love it to be easier to get to the US from Australia, I have come across quite a lot of people from other countries (even quite a lot in Australia). PASS has committed to hosting the Summit in Seattle until at least 2019, and I’m very pleased with this, and 2016 will be in the week of October 24-28.

    Lance Harra (@sqlfarmer) has won the PASSion award. I’m really not surprised at this – over the years I’ve been involved with PASS, I’ve seen his name as one of the regulars.

    Before I know it, Dr Rimma Nehme from Microsoft Research has taken the stage, and she’s speaking about the Internet of Things. Dr David DeWitt is doing part of the talk, and these two complement each other really well. David is a regular at PASS keynotes, while Rimma did her first one last year. I’m not going to try to describe all the points that they’ve covered – but if you are interested in IoT, this is a great talk that covers IoT from basics up to some of the issues they are seeing with it. Even people experienced with IoT will get something out of this, and it’s explained in a nice simple way, as we’ve come to expect from these two.

    Later today I have the Women in IT lunch to attend, another booth session and another main session to deliver, as well as catching up with a lot more people.

    @rob_farley

  • PASS Summit 2015 Keynote 1

    I’m back at the PASS Summit. Another year – my sixth now. And I’m sitting at the bloggers’ table, next to Mark Broadbent (@retracement).

    The PASS Summit is by far the best SQL Server event in the world each year – even better than the Adelaide SQL Server User Group sessions, and the first keynote is always full of announcements.

    It always starts with a bunch of information about how the reach of the SQL community has grown over the years, and it’s an amazing thing to see the impact that the community has these days. Tom La Rock has given the announcements about this, as the current PASS President.

    Joseph Sirosh is up now. He’s the newly appointed Corporate VP for the Data Group now. He’s speaking to the changing face of data. I feel like this is a regular spot at keynotes – data has been changing so fast for years now, and the things that people are using data for becomes more and more impressive. Last year there it was about analysing shopping patterns for people who move through a department store – this year it’s looking at huge quantities of medical information to predict current and future medical conditions. This lets people save lives with data now, because early intervention becomes even more possible.

    Eric Flesichman is a Chief Architect and is a VP in Platform Engineering at DocuSign. He’s talking about how SQL Server was the right fit for them, and this fits in with what I hear at customers too. Microsoft is the leader in both Ability to Execute and Completeness of Vision in the latest the Magic Quadrant by Gartner (as has just been pointed out by Joseph), and people are finding that SQL Server continues to become the sensible choice for even the largest of organisations.

    Shawn Bice up now – General Manager of the Database Systems Group. He’s showing the new features of SQL Server 2016, and pointing out it’s all built-in. It’s not about Add-Ons – everything is part of the product. He talks about how it’s the leader in Mission Critical, the least vulnerable, the highest performing, cheapest cost for BI, and the Advanced Analytics that has come with R’s integration within the platform. R is the biggest language amongst data scientists, whether they are solving problems in space, or marketing, or wherever. To make SQL Server the most significant data platform for data scientists, this R integration is critical.

    HA and DR has been improved with better algorithms for data transfer. I see this as incredibly important. Every improvement in compression and parallelism is an improvement in moving data around, whether between servers within an on-prem system, a hybrid system, or pure Azure.

    PolyBase comes into SQL Server 2016 to let people use T-SQL over Hadoop. I’m used to having PolyBase through APS, and the idea of being able to hook into Hadoop data stores from “regular SQL” provides numerous opportunities. This is going to lower the barrier for people who want to leverage Hadoop into their current environments. This is really exciting.

    The columnstore improvements in SQL 2014 meant that columnstore data could be updated, but in SQL 2016 we get updateable non-clustered columnstore indexes. For people who redesigned tables to leverage columnstore, or rather, saw changes that they would want to make to leverage columnstore and decided against columnstore – these people can now put an updateable columnstore index on a subset of the columns in a table, and leverage the technology much more easily. Again, lowering the barrier.

    Rohan Kumar, a Partner Director in Engineering, is showing the impact of this, with a live dashboard, using a non-clustered columnstore index to explore data. He’s looking at how the R integration and the columnstore improvements provide a platform to discover anomalies in data in much quicker time than ever. I know that fraud analysis happens in close-to-real time within banks, but these changes make this kind of work available to many more organisations.

    Sadly, Rohan then opens Profiler to show what’s happening behind the scenes with AlwaysEncrypted. But it’s Profiler, and everyone has been trying to move off Profiler for some years now. AlwaysEncrypted is impressive, but Profiler???

    Stretched Databases make up the last main demo, and the keynote wraps up. The mood in the place is that these are exciting times.

    @rob_farley

  • PASS Summit 2015 presentations I’m giving

    With only a few days to go until people arrive in Seattle, I should probably explain what my sessions are going to be on. You know, in case you hadn’t thought to go to the PASS site and read for yourself (and for those who want to hear something that’s a little less ‘abstract’y). A few people told me last year they were disappointed I wasn’t presenting, so if you’re in that situation, maybe this post is of interest.

    This year, I’m giving three different presentations – two regular Summit sessions that are on the schedule, and a 20-spot that I’m doing (twice!) at the Microsoft booth (Wednesday at 1:45pm and Thursday at 1:15pm).

    The 20-minute spot is about techniques you can use to avoid data movement in MPP systems such as PDW or SQL DW. It focuses on Query Optimizer things that I do in regular SQL environments, leveraging things like join redundancy, contradiction optimisations, and tuning aggregations. It’s going to be quite fast-paced, as we have three significant things to explore, with heavy use of SQL query plans and showing how those techniques apply to MPP distributed plans. I’ve used these methods to make queries run WAY faster in both MPP and non-MPP environments, and people have said things like “I wish I’d known that before my last client” (someone at Microsoft), and “Oh, that’s cool – I’m so going to use that” (someone at a major US-based PDW partner). So yeah – come along. Both times!

    The first main session I’m giving is one of my favourite sessions, and is called “A few of my favourite query plan operators”. This talk is going to focus on four Query Plan operators, but also discuss what’s going on in about six others. Or maybe seven – it depends how you count them, and how puzzled the looks on people’s faces go. We’ll explore what’s happening as your query runs, and why sometimes the counter-intuitive option could work out better. Lots of demos, as you’d expect from one of my presentations. Some live typing, and plenty of hand-waving as I describe why a particular plan shape is really what you’re after – even if the estimated cost might suggest otherwise.

    The new session is one that I wasn’t expecting to give, so that’ll be fun. It’s on “The Power of Composite Indexes”, and already I’m kinda regretting it (no, not really) because Americans say that word differently to how I say it (for me the longest syllable is the “com”, while for Americans it’s the “pos”). But however you say it, the idea is to look at indexes with multiple keys, and look at how powerful that can be. And how you can kill the performance as well. We’ll be looking at issues like sargability v residuality, blocking plan operators (especially for people that came to my first session), partitions, and even helping you use T-SQL to fix one the most frustrating features about the way that queries run.

    So anyway – those are the sessions that I’m giving at the PASS Summit this year. Hopefully somewhere in all that content there’s something you haven’t heard before.

    @rob_farley

  • SQL Server Auditing - for Business Intelligence when your business is the database

    Over the past decade or so, Business Intelligence has become a big deal. As a data consultant, most of my work would be categorised as being in the BI space. People want to have insight into how their business is operating, and be able to use this to do things better. Data has become one of the biggest influencers in the world today – now that data is available, intuition is generally seen as ‘not good enough’, and people want empirical evidence for making decisions ...at least in my experience.

    And that’s all well and good for people who run businesses. You’re just a DBA. Business Intelligence is something that you support, something that you provide, something which you do – for other people to consume.

    My challenge to you is to become someone who consumes BI as well.

    In fact, just about everyone within your organisation could do better by the stuff that you provide. And that ‘everyone’ includes YOU.

    As a BI developer, you create reports. Do you track how frequently those reports are run? Do you track how long the reports take to produce? How often are they redefined? What KPIs do you put around that?

    As a production DBA, you ensure that backups have been taken, and run tests to make sure every backup can be restored. Are you tracking how long those backups are taking? Are things taking longer? What are the metrics that suggest things are getting worse? Are you using predictive analytics to warn you a system might go down soon?

    As a helpdesk operator, how are you being measured? I’m guessing that your manager is analysing something about the satisfaction level of the people you help, or the number of tasks you get through in the week, and what kinds of tasks use your skills better... are you consuming that information too?

    As a team leader... well, you get the picture.

    Data is all around us. Not just in the Internet of Things, but in the metadata of the systems that we use.

    If you are a data professional, you might be able to spend a bit of time exploring what’s possible using data that is important to you, like SQL Server Audit data, or Windows Event Logs, or report execution logs. If you can get something working in your development environment, get clearance to put it on an Azure instance or production SQL box – something which is looked after and which is properly licensed. But then, start having conversations about how this kind of approach could help just about everyone in the organisation. Big picture stuff is useful, but everyone has a big picture which is useful for them. Stepping back from the minutiae of the day and making intelligent decisions about tomorrow is not just for senior management, but should apply to self-management as well.

    DBAs – get familiar with SQL Server’s auditing. Explore the posts that are coming out today in the T-SQL Tuesday event hosted this month by Sebastian Meine (@sqlity), and use this as a source for your own Business Intelligence system.

    TSQL2sDay150x150

    @rob_farley

  • PASS Board elections coming up

    It’s four years since I ran for election to the PASS Board now, for a two year term which ended two years ago.

    The two other people that got elected when I did were Denise McInerney and Adam Jorgensen. Two years ago, Denise and Adam became the two Vice Presidents, and Adam is about to being a stint as President. For me, I didn’t run a second time. Two years ago, Jen Stirrup, Amy Lewis, and Tim Ford got elected when Rushabh Meta, Douglas McDowell, and I left the board.

    And so we come around to election time again, and Jen, Amy, and Tim are all about to finish their first two years on the board. Amy is not re-running, but Jen and Tim are, along with two new candidates – Argenis Fernandez and Ryan Adams.

    I’m excited, because PASS can’t lose here. The current board members seem to be doing an excellent job (although I didn’t get to serve alongside either Jen or Tim, I know them both well and am sure they’re doing great), but I also know that Argenis and Ryan will do great.

    Argenis is a passionate guy who will stand for important things. I’ve had many conversations with him, and know him to be wise, caring, and dedicated to seeing good things happen. He raises money for charity, and is not afraid to put his reputation on the line for important things. He is one of my most excellent friends.

    Ryan is also a great guy. He works hard for PASS, and has almost won the PASSion award for his dedication to the community. I haven’t had as many conversations with him as I have the other candidates, but I know that he will work tirelessly for the PASS community – that word ‘tireless’ seems to describe Ryan more than anyone else I can think of in the community, and I am a little jealous of his energy levels.

    I can’t tell you who to vote for (except for Jen – as the only person running for the EMEA seat I expect everyone to vote for her), but do take the time to get to know the candidates and make your vote count – voting opens later this week.

    @rob_farley

  • Superhero powers for managing the enterprise

    This month’s T-SQL Tuesday is hosted by Jen McCown of @midnightdba fame. She wants us to write about strategies for managing the enterprise, and as a database consultant, I find myself giving advice on this kind of thing to customers remarkably often.

    No, I’m not going to do stories about LobsterPot customers. We don’t do that. What happens between us and the customer stays between us and the customer. However, I can talk about the kinds of things that we look for when we talk to a customer.

    The thing that I want look at in this post is about that twitch that you get when something doesn’t feel right. The ‘spider-sense’ feeling that Peter Parker gets when there’s something that’s not quite right.TSQL2sDay150x150

    Experience is what helps people know what ‘normal’ looks like. I’ve heard stories that people who are trained to spot counterfeit bank notes don’t spend time with fake notes, they spend time with the real ones (on loan, presumably). They learn what ‘normal’ looks like, and get really familiar with it. That way, when something isn’t quite right, they can spot it and raise an alarm.

    For DBAs taking care of an environment, whether small or large, they will have learned what ‘normal’ looks like. They should have benchmarks that tell them what how various metrics perform, and for those things that they don’t have formal metrics on, they should still be familiar enough to recognise when something isn’t right, even if they can’t tell exactly what. Their ‘spider-sense’ should tingle.

    If you don’t have benchmarks, then get them. Every time you find something that doesn’t seem right, you will wish you had a benchmark on that thing to be able to quantify your suspicion. Feeling like something isn’t right is great, but it won’t be long until someone asks “How do you know this thing needs attention?” and “I just know” probably won’t cut it. If you’re a consultant, you can probably get away with “In my experience...”, because that’s what they’re paying you for, but having supporting evidence – actual numbers – can help, particularly if you’re the main person responsible and are needing to persuade someone to find the money for a CapEx.

    Having the numbers handy is useful for a lot of situations, but there are a whole bunch of tools also available to look at too. A while back I wrote about how DBAs could use the same kinds of tools that other industries hire data professionals to provide, in a post called “DBAs and the Internet of Things”. If you take this kind of approach and start analysing the readings from all kinds of things that affect your database, then you can get ahead of the game. Feed this stuff into something like Azure ML for predictive analytics, and you might be able to have an even-better spider-sense, where the prediction isn’t just based on your own opinions, but on what has caused failures in the past.

    Too often, the significant thing is some small detail that most people wouldn’t notice, but before of your experience and expertise, you can spot it and work out whether it’s significant or not. Then if you don’t have that particular thing benchmarked, or analysed by other tools, you can include it to see what’s going on.

    ...and develop superhero powers for managing your enterprise. It’s something we regularly recommend to DBAs.

    @rob_farley

  • SQL 2014 queries under a strobe light

    I just wrote a post about Live Query Statistics. Let me show you how this technology in SQL Server 2014 can be used for some amazingly cool (nerdy cool, at least) stuff.

    Behind the scenes, LQS uses a DMV called sys.dm_exec_query_profiles. When you run a query with SET STATISTICS PROFILE ON, the engine puts data into this DMV for every operator in the plan that produces data (which doesn’t include the SELECT operator, or Compute Scalar, for example). What SSMS does while you’re watching a long-running query is poll this DMV over and over to get the progress data, so that you can see it in those moments between kicking off the query and its completion.

    When you use LQS on a relatively quick query, say, one that completes in a single second, you can’t exactly watch this information come through. You might be able use a debugger, and pause operation of your server for a moment while you step through it, but this is far from ideal. And yet a query that completes in a single second might actually need some work. What if this query needs to be able to run many times per second, and you’re looking for strategies to tune every last bit out of it?

    Clearly LQS is going to be no use.

    But the workings behind it... that’s another matter. Let me show you.

    When botanists are wanting to study what happens in the flight of an insect, they take lots of photographs, often using a strobe light to capture a moment with as little blur as possible. It allows for incredibly detailed images, like the ones you can see in this article from the Daily Mail in the UK.

    (Image only linked from source – please let me know if it’s not there, but also go and look at the amazing pictures that this guy takes)

    I don’t know how many times this insect flapped its wings between each image that was captured, but I think you’ll agree that with enough images, it would be possible to learn a lot about the way that the wing-flapping takes place.

    This is the same as what I’m doing with queries in SQL Server 2014.

    Suppose I run a query over and over, with statistics profile turned on. The code here runs my sub-second query over and over for six minutes.

    --set statistics profile on

    declare @t datetime = dateadd(minute,6,getdate());

    while (@t > getdate())
    begin

    select p.Name, sum(d.OrderQty) as Qty
    from Sales.SalesOrderDetail d
    join Production.Product p
    on p.ProductID = d.ProductID
    group by p.Name;

    end

    I used an SSMS window for this, and told SSMS to discard the results. I only need to run it once to see the results – I just care about the profile stats. Incidentally, it returns 266 rows.

    Now, I know that the plan that’s being used here is:

    image

    The Node_IDs of these operators are 1 (the Join), 2 (the Aggregate), 3 (the Scan on SalesOrderDetail), and 7 (the Scan on Product). These numbers have gaps just because of the inner workings of the Query Optimizer. And they go from left to right because that’s how a plan runs. The SELECT calls Node 1, which calls Nodes 2 and 7, and Node 2 calls Node 3.

    So during five of the six minutes that my query was running over and over and over, I went into a different window and polled the DMV every half a second.

    set nocount on
    go

    declare @t datetime = dateadd(minute,5,getdate());

    while (@t > getdate())
    begin

    insert dbo.rf_strobe
    select getdate() as myquerytime, *, 'Hash Match'
    from sys.dm_exec_query_profiles

    waitfor delay '0:00:00.5'

    end

    I had made a table which matched the DMV, with a couple of extra columns thrown in. One to record when I captured the moment (so that I could differentiate between moments), and one (called ‘comment’) to comment which query I was monitoring. The DMV contains the sql_handle and plan_handle, so I could’ve differentiated between them later, but I wanted to be able to differentiate between them more easily than that.

    This data gave me 597 different values for ‘myquerytime’, 597 different moments captured. I don’t know how many different times my query ran in that time – probably far more than 597, although I wouldn’t have really cared if it were less. These 597 moments each had up to 4 rows, showing how each operator was going in its cycle.

    There are columns in the DMV for all kinds of attributes, but the one that I was most interested in was the row_count, which tells me how many rows the operator has produced. There are columns about CPU, but as I’m looking at a query which runs in a small number of milliseconds, I’m more interested in the number of rows its produced.

    Here’s a some of the data that’s in my table.

    image

    You can see that when I polled the DMV at 19:28:20.303, Node 3 (the Clustered Index Scan on SalesOrderDetail) had pushed 25241 rows, and no rows had been outputted by the other three. Half a second later, the snapshot showed 91044 rows from that node. Another half a second, and it was 54279, and in the poll at 19:28:21.807, the scan had served up all its rows, and there was output from the other three operators.

    Each of these is like a strobe image, capturing a moment in the life of the query.

    To recreate how the query runs, we need to piece them together. To do this, I’m going to assume that the outputted rows are done in the same order each time (which is a reasonable assumption when I’m running the query over and over in quick succession, with no changing parameters or conditions). By summing the row_count across all the operators in each poll, I can order the polls. A quick bit of pivotting...

    select row_number() over (order by OverallRowCount) as rownum, *
    from
    (
    select    
        count(*) + sum(row_count) as OverallRowCount,
        max(case when node_id = 1 then row_count end) as HashMatchJoin,
        max(case when node_id = 2 then row_count end) as HashMatchAgg,
        max(case when node_id = 3 then row_count end) as CIXScan,
        max(case when node_id = 7 then row_count end) as IXScan
    from dbo.rf_strobe
    where comment = 'Hash Match'
    group by myquerytime
    ) s
    order by rownum
    ;

    ...and I can see the query start:

    image

    ..and the moment when the Clustered Index Scan stops outputting more rows:

    image

    ...and the end of the query.

    image

    Notice that almost all the time we polled the DMV to see how the query was going, data was still being pulled out of SalesOrderDetail. It was only 96.1% (574/597) of the way into the query that data started to be outputted from the blocking Hash Aggregate. And because a Hash Match Join blocks until the hash table has been completed, we have to wait even longer before we eventually start pulling data from the Product table, when we pull up to 504 rows to find the 266 rows of interest.

    I’m sure you can get a feel for how this query is running from this information.

    But let’s compare this to what happens if I force a Nested Loop join instead of the Hash Match Join. This plan:

    image

    I repeated the collection of data, running the same query over and over but with OPTION (LOOP JOIN), and a different comment in the polled data. I also decided to use dynamic SQL to query my strobe table to save rewriting the pivot for each plan.

    declare @comment varchar(100) = 'Forced Loop Join';
    declare @qry nvarchar(max) =
    'select row_number() over (order by OverallRowCount) as rownum, *
    from
    (
    select
        count(*) + sum(row_count) as OverallRowCount
    ' + (select ', max(case when node_id = ' + cast(node_id as varchar(10)) + '
                        then row_count end) as [' +
                        cast(node_id as varchar(10)) + ': ' + physical_operator_name + ']'
            from dbo.rf_strobe
            where comment = @comment
            group by node_id, physical_operator_name
            order by node_id
            for xml path(''),type).value('.','nvarchar(max)')
    + '
    from dbo.rf_strobe
    where comment = @comment
    group by myquerytime
    ) s
    '
    ;
    exec sp_executesql @qry, N'@comment varchar(100)', @comment = @comment;

    It started very similarly, but was very different towards the end.

    The Scan seemed to go for a longer portion of the plan – 98.5% (587/596), but as the Hash Match Aggregate started producing rows, the Nested Loop was pulling the row from the Seek and returning it to the SELECT operator before pulling the next row in. You can see the row_count going up equally across the three operators, which is very different to what we saw with the Hash Match Join.

    image

    A factor I hadn’t considered before became evident at the end. In the Hash Match Join example, we saw a bunch of moments when all the rows had produced their data, which aren’t there in the Nested Loop example. The result of the query is no different, but the tear-down time is much quicker with the Nested Loop – presumably because the Hash Table used for the join doesn’t need to be dropped. I didn’t expect this to be as significant as it seems to be, but we certainly managed to catch six images – about 1% of them – when the operators had all stopped returning data, but the query was still hanging around as far as the DMV was concerned. With everything else being identical, I can only assume it’s down to the Hash Table having more to do to shut down than the Nested Loop.

    Just for fun, I tested the pivot query itself, giving 582 strobe images.

    image

    image

    The first thing to notice is that the Compute Scalar operators didn’t report anything, as expected.

    image

    The next is that the right-most Sort was blocking, and had a significant pause after the Scan finished – about 35 rows or 6% of the query time.

    image

    The Stream Aggregate doesn’t block, but the left-most Sort, which has called the Stream Aggregate (via two Compute Scalars) does, following which there’s another pause (but smaller – fewer rows), after which the Segment and Sequence Project operators don’t block.

    image

    At the end of the query we have about 80 rows – well over 10% of the query time – after the Sequence Project has outputted its last row.

    image

    image

    Now – there is more that I will learn about this still, and I have been making some assumptions about whether the DMV polling gives a sufficiently random moment. But from the things I’ve seen, there is definitely information about queries that I haven’t seen before and which require some further research.

    Finally, I spent a bit of time looking at visualisation options for this. I immediately thought of the Power View play axis that’s on scatter charts, but sadly I didn’t have enough luck coming up with an effective visualisation very quickly. I had felt like Hans Rosling with his scatter chart about birth rates, and used LAG and MAX() OVER() to come up with a speed of row production compared to the total, but I kept getting the “representative sample” message, which wasn’t conducive. I’m sure it won’t be long before this visualisation would be easy, and for the purposes of analysis, I was more interested in exploring the data rather than making it look pretty.

    Strobe photography is very impressive. I just never thought it would apply to T-SQL queries.

    @rob_farley

  • Live Query Statistics in SQL 2014, not just SQL 2016

    One of the coolest things in SQL Server 2016 is Live Query Statistics (LQS). But did you know that it’s available in SQL 2014? (Edit: Needs SP1)

    The thing is that we haven’t been able to view it effectively before now, before SQL Server Management Studio 2016 became available.

    LQS provides the ability to watch an execution plan while the query is still running.

    image

    In this image above, you can see that I have a fairly long-running query, and I’ve taken a screen shot while the query is running. I can tell it’s still running because of the dotted lines. At this point, the Hash Match has asked the Index Scan for all the rows it’s going to, and the Index Scan has been marked as 100% done. The Hash Match is now asking the Merge Join for rows. The Merge Join is pulling data out of the Sort, which has already pulled in the rows it needed, and so on.

    It’s cool stuff.

    And significantly, this is against SQL Server 2014 (SP1). I had to use SSMS 2016, but then the button appeared at the top to include Live Query Statistics...

    image

    ...and then I could see them. When I connected to an earlier version, such as SQL Server 2012, the box was disabled and I couldn’t see them.

    So why not install SSMS 2016 (it’s only the client tool – I’m not suggesting you put it on a server), and see how some of your queries look?

    You shouldn’t do this against a production machine, because it takes some effort for SQL Server to produce the data used for this. But see how you go. It’s definitely a useful performance-tuning feature which is available right now.

    @rob_farley

  • Can you still use DBCC PAGE to access data in dropped tables if TDE is turned on?

    Yes.

    #ShortestPostEver

    @rob_farley 

     

    Oh, ok... I’ll explain some more.

    Transparent Data Encryption encrypts data at rest. That’s the stuff that’s on disk – the encryption happens when the data is written to disk, and the decryption happens as the data is loaded into RAM from the disk. The engine handles this so that it’s invisible to the user, applications, and so on. Without it, you can open an MDF/NDF file in a hex editor and read the contents. With it, you can’t.

    Here’s an example with a database that’s not encrypted:

    image

    And here’s an example that is:

    image

    I searched for some of the text that I could see – successfully in the first, unsuccessfully in the second.

    I also used SQL Server to show me the contents of a page using DBCC PAGE, and could do this successfully (once I’d closed the files in the hex editor and brought the databases back online).

    image

    ...which also worked in both databases.

    image

    I had hoped this would work okay, because I figured that DBCC PAGE would have to pull the data into RAM again (remember this system was offline – the pages weren’t in RAM before), and that it would decrypt this as it did it. But I wondered if DBCC PAGE might be slightly lower-level, and bypass it somehow. I argued with myself that if TDE was indeed Transparent, it shouldn’t care... what if my application relied on using DBCC PAGE, it’s a known feature, even if it is officially undocumented (which is where my doubts set in).

    But as you see, it worked okay.

    But what if I dropped the table first? Would SQL then go “Hang on – this page isn’t one that I have control over any more...” and refuse to decrypt it?

    No – it works just the same.

    Even if you drop an object, you can still access the pages that it used until they get overwritten. You won’t be able to read them with a hex editor, but DBCC PAGE will still read them in decrypted form, letting you pull that data out.

    And yes, you can even use DBCC WRITEPAGE to overwrite the bytes in their unencrypted form, so that you can use (dangerous) method of fixing corruption, even in an encrypted database. I’ve just redone my fix for Steve Stedman’s 10th Corruption Challenge, and it worked just fine on an encrypted version of the database.

    It’s still T-SQL Tuesday on the topic of encryption, so I’m throwing this one into the mix for that as well.

    TSQL2sDay150x150

More Posts Next page »

This Blog

Syndication

News

News? Haven't you read my blog?

My Company


Can't find something?

Contact Me

IM: rob_farley@hotmail.com
Twitter: @rob_farley
Skype: rob_farley
E: rob_farley@hotmail.com

MVP (SQL Server)




Certifications








Adelaide SQL UG

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement