THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server) and leader of the SQL User Group in Adelaide, Australia. Rob is also a Director of PASS, and runs training courses around the world in SQL Server and BI topics.

  • Plan Operator Tuesday round-up

    Eighteen posts for T-SQL Tuesday #43 this month, discussing Plan Operators.

    I put them together and made the following clickable plan. It’s 1000px wide, so I hope you have a monitor wide enough.

    Let me explain this plan for you (people’s names are the links to the articles on their blogs – the same links as in the plan above).

    It was clearly a SELECT statement. Wayne Sheffield (@dbawayne) wrote about that, so we start with a SELECT physical operator, leveraging the logical operator Wayne Sheffield.

    The SELECT operator calls the Paul White operator, discussed by Jason Brimhall (@sqlrnnr) in his post. The Paul White operator is quite remarkable, and can consume three streams of data. Let’s look at those streams.

    The first pulls data from a Table Scan – Boris Hristov (@borishristov)’s post – using parallel threads (Bradley Ball@sqlballs) that pull the data eagerly through a Table Spool (Oliver Asmus@oliverasmus). A scalar operation is also performed on it, thanks to Jeffrey Verheul (@devjef)’s Compute Scalar operator.

    The second stream of data applies Evil (I figured that must mean a procedural TVF, but could’ve been anything), courtesy of Jason Strate (@stratesql). It performs this Evil on the merging of parallel streams (Steve Jones@way0utwest), which suck data out of a Switch (Paul White@sql_kiwi). This Switch operator is consuming data from up to four lookups, thanks to Kalen Delaney (@sqlqueen), Rick Krueger (@dataogre), Mickey Stuewe (@sqlmickey) and Kathi Kellenberger (@auntkathi). Unfortunately Kathi’s name is a bit long and has been truncated, just like in real plans.

    The last stream performs a join of two others via a Nested Loop (Matan Yungman@matanyungman). One pulls data from a Spool (my post@rob_farley) populated from a Table Scan (Jon Morisi). The other applies a catchall operator (the catchall is because Tamera Clark (@tameraclark) didn’t specify any particular operator, and a catchall is what gets shown when SSMS doesn’t know what to show. Surprisingly, it’s showing the yellow one, which is about cursors. Hopefully that’s not what Tamera planned, but anyway...) to the output from an Index Seek operator (Sebastian Meine@sqlity).

    Lastly, I think everyone put in 110% effort, so that’s what all the operators cost. That didn’t leave anything for me, unfortunately, but that’s okay. Also, because he decided to use the Paul White operator, Jason Brimhall gets 0%, and his 110% was given to Paul’s Switch operator post.

    I hope you’ve enjoyed this T-SQL Tuesday, and have learned something extra about Plan Operators. Keep your eye out for next month’s one by watching the Twitter Hashtag #tsql2sday, and why not contribute a post to the party? Big thanks to Adam Machanic as usual for starting all this.

    @rob_farley

  • Spooling in SQL execution plans

    Sewing has never been my thing. I barely even know the terminology, and when discussing this with American friends, I even found out that half the words that Americans use are different to the words that English and Australian people use. TSQL Tuesday

    That said – let’s talk about spools! In particular, the Spool operators that you find in some SQL execution plans. This post is for T-SQL Tuesday, hosted this month by me! I’ve chosen to write about spools because they seem to get a bad rap (even in my song I used the line “There’s spooling from a CTE, they’ve got recursion needlessly”). I figured it was worth covering some of what spools are about, and hopefully explain why they are remarkably necessary, and generally very useful.

    If you have a look at the Books Online page about Plan Operators, at http://msdn.microsoft.com/en-us/library/ms191158.aspx, and do a search for the word ‘spool’, you’ll notice it says there are 46 matches. 46! Yeah, that’s what I thought too...

    image

    Spooling is mentioned in several operators: Eager Spool, Lazy Spool, Index Spool (sometimes called a Nonclustered Index Spool), Row Count Spool, Spool, Table Spool, and Window Spool (oh, and Cache, which is a special kind of spool for a single row, but as it isn’t used in SQL 2012, I won’t describe it any further here).

    Spool, Table Spool, Index Spool, Window Spool and Row Count Spool are all physical operators, whereas Eager Spool and Lazy Spool are logical operators, describing the way that the other spools work. For example, you might see a Table Spool which is either Eager or Lazy. A Window Spool can actually act as both, as I’ll mention in a moment.

    In sewing, cotton is put onto a spool to make it more useful. You might buy it in bulk on a cone, but if you’re going to be using a sewing machine, then you quite probably want to have it on a spool or bobbin, which allows it to be used in a more effective way. This is the picture that I want you to think about in relation to your data.

    I’m sure you use spools every time you use your sewing machine. I know I do. I can’t think of a time when I’ve got out my sewing machine to do some sewing and haven’t used a spool. However, I often run SQL queries that don’t use spools. You see, the data that is consumed by my query is typically in a useful state without a spool. It’s like I can just sew with my cotton despite it not being on a spool!

    image

    Many of my favourite features in T-SQL do like to use spools though.

    image

    This looks like a very similar query to before, but includes an OVER clause to return a column telling me the number of rows in my data set. I’ll describe what’s going on in a few paragraphs’ time.

    So what does a Spool operator actually do?

    The spool operator consumes a set of data, and stores it in a temporary structure, in the tempdb database. This structure is typically either a Table (ie, a heap), or an Index (ie, a b-tree). If no data is actually needed from it, then it could also be a Row Count spool, which only stores the number of rows that the spool operator consumes. A Window Spool is another option if the data being consumed is tightly linked to windows of data, such as when the ROWS/RANGE clause of the OVER clause is being used. You could maybe think about the type of spool being like whether the cotton is going onto a small bobbin to fit in the base of the sewing machine, or whether it’s a larger spool for the top.

    A Table or Index Spool is either Eager or Lazy in nature. Eager and Lazy are Logical operators, which talk more about the behaviour, rather than the physical operation. If I’m sewing, I can either be all enthusiastic and get all my cotton onto the spool before I start, or I can do it as I need it. “Lazy” might not the be the best word to describe a person – in the SQL world it describes the idea of either fetching all the rows to build up the whole spool when the operator is called (Eager), or populating the spool only as it’s needed (Lazy).

    Window Spools are both physical and logical. They’re eager on a per-window basis, but lazy between windows.

    And when is it needed?

    The way I see it, spools are needed for two reasons.

    1 – When data is going to be needed AGAIN.

    2 – When data needs to be kept away from the original source.

    If you’re someone that writes long stored procedures, you are probably quite aware of the second scenario. I see plenty of stored procedures being written this way – where the query writer populates a temporary table, so that they can make updates to it without risking the original table. SQL does this too. Imagine I’m updating my contact list, and some of my changes move data to later in the book. If I’m not careful, I might update the same row a second time (or even enter an infinite loop, updating it over and over). A spool can make sure that I don’t, by using a copy of the data. This problem is known as the Halloween Effect (not because it’s spooky, but because it was discovered in late October one year). As I’m sure you can imagine, the kind of spool you’d need to protect against the Halloween Effect would be eager, because if you’re only handling one row at a time, then you’re not providing the protection... An eager spool will block the flow of data, waiting until it has fetched all the data before serving it up to the operator that called it.

    In the query below I’m forcing the Query Optimizer to use an index which would be upset if the Name column values got changed, and we see that before any data is fetched, a spool is created to load the data into. This doesn’t stop the index being maintained, but it does mean that the index is protected from the changes that are being done.

    image

    There are plenty of times, though, when you need data repeatedly. Consider the query I put above. A simple join, but then counting the number of rows that came through. The way that this has executed (be it ideal or not), is to ask that a Table Spool be populated. That’s the Table Spool operator on the top row. That spool can produce the same set of rows repeatedly. This is the behaviour that we see in the bottom half of the plan. In the bottom half of the plan, we see that the a join is being done between the rows that are being sourced from the spool – one being aggregated and one not – producing the columns that we need for the query.

    Table v Index

    When considering whether to use a Table Spool or an Index Spool, the question that the Query Optimizer needs to answer is whether there is sufficient benefit to storing the data in a b-tree. The idea of having data in indexes is great, but of course there is a cost to maintaining them. Here we’re creating a temporary structure for data, and there is a cost associated with populating each row into its correct position according to a b-tree, as opposed to simply adding it to the end of the list of rows in a heap. Using a b-tree could even result in page-splits as the b-tree is populated, so there had better be a reason to use that kind of structure. That all depends on how the data is going to be used in other parts of the plan. If you’ve ever thought that you could use a temporary index for a particular query, well this is it – and the Query Optimizer can do that if it thinks it’s worthwhile.

    It’s worth noting that just because a Spool is populated using an Index Spool, it can still be fetched using a Table Spool. The details about whether or not a Spool used as a source shows as a Table Spool or an Index Spool is more about whether a Seek predicate is used, rather than on the underlying structure.

    Recursive CTE

    I’ve already shown you an example of spooling when the OVER clause is used. You might see them being used whenever you have data that is needed multiple times, and CTEs are quite common here.

    With the definition of a set of data described in a CTE, if the query writer is leveraging this by referring to the CTE multiple times, and there’s no simplification to be leveraged, a spool could theoretically be used to avoid reapplying the CTE’s logic. Annoyingly, this doesn’t happen. Consider this query, which really looks like it’s using the same data twice. I’m creating a set of data (which is completely deterministic, by the way), and then joining it back to itself. There seems to be no reason why it shouldn’t use a spool for the set described by the CTE, but it doesn’t.

    image

    On the other hand, if we don’t pull as many columns back, we might see a very different plan.

    image

    You see, CTEs, like all sub-queries, are simplified out to figure out the best way of executing the whole query. My example is somewhat contrived, and although there are plenty of cases when it’s nice to give the Query Optimizer hints about how to execute queries, it usually doesn’t do a bad job, even without spooling (and you can always use a temporary table).

    When recursion is used, though, spooling should be expected.

    Consider what we’re asking for in a recursive CTE. We’re telling the system to construct a set of data using an initial query, and then use set as a source for another query, piping this back into the same set and back around. It’s very much a spool. The analogy of cotton is long gone here, as the idea of having a continual loop of cotton feeding onto a spool and off again doesn’t quite fit, but that’s what we have here. Data is being fed onto the spool, and getting pulled out a second time when the spool is used as a source.

    image

    (This query is running on AdventureWorks, which has a ManagerID column in HumanResources.Employee, not AdventureWorks2012)

    The Index Spool operator is sucking rows into it – lazily. It has to be lazy, because at the start, there’s only one row to be had. However, as rows get populated onto the spool, the Table Spool operator on the right can return rows when asked, ending up with more rows (potentially) getting back onto the spool, ready for the next round. (The Assert operator is merely checking to see if we’ve reached the MAXRECURSION point – it vanishes if you use OPTION (MAXRECURSION 0), which you can try yourself if you like).

    Spools are useful. Don’t lose sight of that. Every time you use temporary tables or table variables in a stored procedure, you’re essentially doing the same – don’t get upset at the Query Optimizer for doing so, even if you think the spool looks like an expensive part of the query.

    I hope you’re enjoying this T-SQL Tuesday. Why not head over to my post that is hosting it this month to read about some other plan operators? At some point I’ll write a summary post – once I have you should find a comment below pointing at it.

    @rob_farley

  • Running goals - an update

    Back in January, I wrote about some of my “running goals”. It’s time to update those of you who aren’t connected to me on Twitter or Facebook (or weren’t looking on the right days).

    I mentioned that I wanted to get a better time in a half marathon.

    Yes. I did that. I ran two half marathons in Adelaide this this year, with times of 2:04:58 and 2:03:57.

    half1half2

    I mentioned that I wouldn’t mind trying a full marathon.

    Last week, I did that. It was in the Barossa Valley – the wine region just north of Adelaide. My official time was 5:18:24. I probably could’ve been faster, but I’d had injuries along the way during my training. The South Australian Road Runners’ Club had a mentor program for people interested in doing a marathon, and I got involved. I didn’t make it to many (none) of the Sunday morning training runs, but I lined up anyway, and ran a marathon.

    The date of the marathon was significant for me – it was the day before the tenth anniversary of my back injury. In hindsight, I would never recommend running a marathon the day before a significant day like that. I was pleased I’d finished the marathon (I wasn’t aiming for a particular time, and was just pleased to finish – being hit by a car around 31km in didn’t really help my time, but I managed to get around), and I hadn’t really expected the impact of the injury-anniversary to affect me so much. I got physically and emotionally sick for a few days, and it was horrible. Ten years is a long time – more than a quarter of my life – and I know that it’s been tough not just on me but on those around me. Completing a marathon hasn’t made me feel like I’ve conquered my injury, it just helps me feel like I know I can keep going despite it.

    running << I’m smiling because someone’s pointing a camera at me. And because I can see the finish line.

    I mentioned I wanted to keep doing some cardio every day, and lose more weight.

    This fell off in mid-Feb when I got the first injury of my marathon training. I picked up a thing called “hip bursitis”. That led to increased back pain, and doing something every day was just beyond me. I got below 200lb, but only just. I’m roughly that now still, and I’m okay about that. I might try an “every day” thing again soon, but I’ll see how I go.

    I wanted to run some distance during 2013. 750 miles? 900? 1000?

    Well, five months in, I’ve run 341. That tracks to a bit over 800. But also I’ve run a marathon. And right now, a week later, part of me thinks I’ve achieved enough running goals, and it would be good to NOT run for a while. So forgive me if I don’t manage to run 1000 miles during 2013. I’d have to do over 20 miles every week from now to reach 1000 – my injuries just aren’t compatible with that.

    Running is tough. It’s not completely good for my back, and I have mixed emotions about it. As people have pointed out, I’m not exactly the right build for running... but that’s not the point. The point is that I have a back injury, and I need to work with my physio to continue to see it improve. Running might be part of that, but there are lots of things that I still don’t have in my life that I would like to be able to have again. I’d like to be able to dance. I’d like to be able to play sport with my sons. I’d like to be able to continue to pick up my daughter (who’s five, and getting towards the kind of weight that I can’t actually lift). One day, I plan to carry my wife over the threshold of the house we built. Any of those things is going to take a lot of work – but a year ago, I couldn’t run either.

    What, what? A car?

    Yes – I got hit by a car during the Barossa Marathon. It wasn’t like it took my legs out and I rolled across the windscreen. It’s more that it hit my hands.

    A marshal had stopped this car at a point where we had to cross the road we were running on. It was at the 31km point – almost 20 miles into the 26 miles route. The driver had wound down her window and was talking to the marshal. That was all fine. I took it fairly wide, and crossed in front of the car. The marshal was saying to her “Don’t pull off yet”, but she did – I guess she was still looking at the marshal as she took her foot of the clutch. I was in front of her at that point, and as she started forward, I wasn’t moving fast enough to jump out of the way. I turned and put my hands (and weight) on her car, and she stopped. I almost fell over, but didn’t.

    Annoyingly, I’d stopped running then. Later in the run, Erin Stellato (@erinstellato) tweeted to me that it’s easier to keep going if you never stop. Well, I had trouble getting going again. My legs were sore, of course. My back had just got a jolt I wasn’t expecting, as had my adrenalin. I was sweating and leaning forward (so my eyes were starting to suffer). It took me an embarrassingly long time to finish crossing the road. Thankfully there was a drink station there, where I washed my eyes out, and I kept going, about two minutes per mile slower than I’d been going beforehand. I’m not saying I would’ve broken five hours, but I would recommend if time is important to you that you don’t have an altercation with a vehicle part way round.

    Massive thanks to the people that sent the 187 tweets during my run (read to me by the RunMeter app). It helped. Now go and give money to Yanni Robel’s fundraising. I run in a Team In Training shirt to honour the effort that she’s putting in, so please – give money to her fund. Also, if you’re reading this in the first 8-ish hours after I’ve published it, send Yanni a message on Twitter (@yannirobel), because she’s about to run a marathon too!

  • T-SQL Tuesday #43 – Hello, Operator?

    June 11th is next week. It’s a Tuesday, the second Tuesday of the month, making it T-SQL Tuesday! This is your opportunity to write on a specific topic, along with countless* people from around the SQL community (* at least until the end of the day, when it will be quite easy to count how many people joined in).TSQL2sDay150x150 This month marks 3.5 years since it started, with 42 events held so far.

    This month, for number 43, I’m the host. That means that I set the topic, and this blog post is the one that you need to get a comment into so that I can find your post before I collate them.

    The topic is Plan Operators. If you ever write T-SQL, you will almost certainly have looked at execution plans (if you haven’t, go look at some now. I mean really – you should be looking at this stuff). As you look at these things, you will almost certainly have had your interest piqued by some, and tried to figure out a bit more about what’s going on.

    That’s what I want you to write about! One (or more) plan operators that you looked into. It could be a particular aspect of a plan operator, or you could do a deep dive and tell us everything you know. You could relate a tuning story if you want, or it could be completely academic. Don’t just quote Books Online at me, explain what the operator means to you. You could explore the Compute Scalar operator, or the many-to-many feature of a Merge Join. The Sequence Project, or the Lazy Spool. You’re bound to have researched one of them at some point (if you never have, take the opportunity this week), and have some wisdom to impart. This is a chance to raise the collective understanding about execution plans!

    ComputeScalarMergeJoinSequenceProjectLazySpool

    So, the T-SQL Tuesday thing...

    If you haven’t heard about T-SQL Tuesday before, then do this exercise with me. Do a quick search for “T-SQL Tuesday”. If you glance down the results,  you’ll see a bunch of posts either inviting people to a T-SQL Tuesday, contributing in one (the ones that contribute link back to the host), or summarising the posts. The ‘host post’ (which this month is this one!) will have a bunch of comments and trackbacks, pointing to all the contributing posts (and hopefully to the summary too). All-in-all, it makes a terrific resource about that particular subject.

    So here’s what you do!

    1. Some time next Tuesday (GMT) – that’s June 11th 2013 – publish your blog post. If you’re in Australia like me, GMT Tuesday runs from mid-morning on the Tuesday through to mid-morning on Wednesday. If you’re in the US, then it’ll run from Monday evening through to Tuesday evening.

    2. Make sure that your post includes, somewhere near the top, the T-SQL Tuesday image (from above), with a link to this post. This helps people find the other posts on the topic, and hopefully encourages others to jump in with their own posts. If it helps, just switch to source/HTML mode, and put the following snippet in there:
    <a href="http://sqlblog.com/blogs/rob_farley/archive/2013/06/02/t-sql-tuesday-43-hello-operator.aspx" target="_blank"><img alt="TSQL Tuesday" align="right" border="0" src="http://sqlblog.com/blogs/rob_farley/TSQL2sDay150x150_1B3B2D1E.jpg"/></a>

    3. Come back to this post, and write a comment, providing a link to your post. That’s probably the only way I’ll know it’s there. You can tweet, sure (use the hashtag #tsql2sday), but I still might not find it.

    4 (optional, but maybe worthwhile). Keep your eye out for other people’s posts, and for when I publish the summary...

    That’s about it – happy writing!

    Remember: 1: Jun 11 GMT; 2: Image with a link; 3: Comment below.

    @rob_farley

  • Part of the journey: failure

    The topic for this month’s T-SQL Tuesday is about the journey. Wendy Pastrick’s choice (I’m hosting again next month!).TSQL2sDay150x150

    There are a lot of journeys. There are some that just keep going, and others that seem to finish (some in success; some in failure). Of course, many of the ones that finish end up being the start of new journeys, but sometimes they don’t need to continue – they just need closure. There are things that can be learned regardless of how things went, whether or not goals were reached, and whether or not there was failure.

    There’s been a few things recently to remind me of this...

    I visited a company recently who has put a video together promoting the idea of failure. It wasn’t asking that people fail, but said “Go ahead and fail,” because failure happens. They had been through a rough time, but were persisting and seeing things turn around.

    Just the other night, we saw the musical Chitty Chitty Bang Bang (you probably know the movie – the Bond film where Dick van Dyke stars as the guy with the gadget-car, who takes on Goldfinger and falls for the girl with the inappropriate name). Anyway, there’s a brilliant song in that called “The Roses of Success” (YouTube link there). It has the same sentiment – “…from the ashes of disaster grow the roses of success!”

    A few years ago, my kids started saying “FAIL!” when someone did something wrong. I can’t say I liked the insult. Far worse would’ve been “DIDN’T TRY!” It would be very easy to just stay in bed and ‘avoid failure’ that way, but anyone who fails has at least done something. To fail, you must at least be active.

    I talk to a lot of people about Microsoft Certification, particularly people who have failed an exam. I tell everyone (not just those who have failed before) to try the exams before they feel they’re ready for them, . What’s the worst that can happen? Worst case, they don’t pass. But how is that a bad thing? It might feel less than brilliant (I know, it’s happened to me before), but it gives an opportunity to target the weak areas before having a subsequent attempt. It doesn’t matter how many attempts it takes to get a passing score – the wrong option would be to give up. Studying can be excellent, but not to the point of causing extra stress.

    There are things in life we do easily, and there are things that we struggle with. I know there’s a bunch of stuff in my own life that falls into both categories. I don’t want this post to be a list of the things that I’m not doing well – I simply want to point out that I want to keep trying.

    With God’s help, I can improve in the areas in which I’m not excelling, and start to smell the roses of success.

  • Why I present

    Louis Davidson just asked me why I write, and now Bob Pusateri (@sqlbob) is asking me why I present, which is his question for this month’s T-SQL Tuesday.

    TSQL2sDay150x150

    If you didn’t follow the link to see his actual question, you’ll need to know that he actually posed the question “How did you come to love presenting?”

    Well, sometimes I don’t, but on the whole, I have to admit that presenting is part of who I am, and I miss it if I’m not presenting. It’s why despite being a PASS board member (that link will only seem relevant if you’re reading this while I’m still one) and having plenty of reason to NOT present at the PASS Summit in 2013, I’ve submitted the maximum number of abstracts for consideration. It’s why I want to be teaching more, both online and in the classroom, and so on.

    It’s not that I think I have anything important to say (although I do only ever teach / present on things that I think are important).

    It’s not that I think I’m good at presenting (my feedback scores beg to differ).

    It’s not that I’m comfortable presenting (I still get ridiculously nervous most of the time).

    I’m just addicted to it.

    It’s a drug – it really is.

    I spend my time walking around the room, or around the stage, explaining things to people, watching for those moments when the audience gets it, and... well, I’m addicted to it.

    If you watch http://www.sqlbits.com/Sessions/Event5/Designing_for_simplification, you’ll see a few things. I was in Wales, and had started with the few words in Welsh that I know (but that’s been edited out – hopefully when I thought I was saying ‘hello’ I wasn’t actually insulting anyone). I nearly fell off the stage. I broke the microphone. I typed some things wrong in my queries. People complained that I didn’t say anything significant…

    But around 33:10 in, you hear the audience almost start clapping. IN THE UK (where people don’t clap for presentations). It’s a moment where people see something they weren’t expecting, and (hopefully) realise the potential in what they’ve heard.

    Phil Nolan wrote nicely about me on his blog, and said “Those of you who know Rob Farley will know he’s a funny guy with an enormous armoury of shockingly bad jokes.” More importantly though, he wrote “His design tips challenged a number of our ideas and meant I took away many valuable techniques,” which helped me know why I present.

    …because it’s not about me, it’s about you. I present because at least one of the people in the audience will benefit from it. And that’s addictive.

    @rob_farley

  • Filegroups and Non-Clustered Indexes

    Let’s start with some basics and then jump in a bit deeper, for this post to go with the 40th T-SQL Tuesday, hosted this month by Jen McCown. TSQL2sDay150x150

    SQL Server holds data, and that data is stored physically in files.

    Of course, in the database world we think of the data as living in tables*, so naturally there must be some sort of mapping between the data in a table and the data in a file. This is where filegroups come in.

    When SQL objects are created within a database, there is an option to be able to tell the system where to actually store it. But this isn’t a file, it’s a filegroup. (If it were a file and the disk volume that the file was on filled up, we’d have big problems. Instead, we put it in a filegroup, and can add extra files (on different volumes if so desired) to that filegroup.) Objects are stored within filegroups. Filegroups are groups of files, although many database systems do end up with only a single file per filegroup.

    Filegroups end up providing a large amount of flexibility for the storage of data. Rarely accessed data can be put in filegroups that have files on cheaper (but probably slower) disk, while data that is heavily written can be put on disk that’s more appropriate for that, and so on. I’m sure you get the picture, and this is nothing overly thought-worthy.

    You may even have already considered the idea around partitioning data across filegroups, moving data of a certain age (but potentially even from the same table) onto a different filegroup, so that queries that use different parts of tables can benefit from having some data on faster disk. Lots of potential. :)

    Where it becomes interesting though, is when you start to consider the concept of non-clustered indexes.

    You see, while a clustered index (or heap) provides the structure to store the data across all the columns of a table (although I’m simplifying it a little in regard to LOB data), we use copies of some of that data (in non-clustered indexes) to optimise access to the data.

    So where are these copies stored? Many database people don’t even think about this kind of thing, in the same way that they often don’t think about including indexes in their database design. I’m going to figure you’re not in that boat though, because I probably know you, and you’re probably not one of those people.

    Most people like to have a default filegroup that is NOT the primary filegroup. It means that when they create new objects, those new objects get put in the default filegroup, not the primary one. But it’s not actually that simple.

    Let’s start by creating a database.

    CREATE DATABASE fg_testing;
    GO
    USE fg_testing;
    GO
    SELECT *
    FROM sys.filegroups;
    --Only one filegroup at the moment

    image

    Notice the column data_space_id. This is the column which identifies each filegroup. We’ll use it later.

    Let’s create a new filegroup and set it to be the default.

    ALTER DATABASE fg_testing ADD FILEGROUP FG2;
    GO
    SELECT *
    FROM sys.filegroups;
    GO

    image

    Cool – data_space_id 2 is created.

    ALTER DATABASE fg_testing MODIFY FILEGROUP FG2 DEFAULT;

    image

    I think this is a shame – but I guess it’s fair enough. We have to have a file in the filegroup before we can make it the default. That’s easy enough though, and probably what we’d be wanting to do before too long anyway.

    ALTER DATABASE fg_testing ADD FILE ( NAME = N'file2', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL2012\MSSQL\DATA\fg_testing_file2.ndf') TO FILEGROUP FG2;
    GO
    ALTER DATABASE fg_testing MODIFY FILEGROUP FG2 DEFAULT;

    image

    (I’m doing this on my laptop, which only has one physical drive – on a server it’d go somewhere more sensible of course)

    Now let’s add another filegroup. This is going to be for some data, but I don’t want it to be my default.

    ALTER DATABASE fg_testing ADD FILEGROUP FG3;
    GO
    ALTER DATABASE fg_testing ADD FILE ( NAME = N'file3', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL2012\MSSQL\DATA\fg_testing_file3.ndf') TO FILEGROUP FG3;

    image

    Now, I want to create an object on FG3, which I’m going to pretend is my slower disk.

    Once created, I’m looking in sys.objects to confirm which filegroup the table is in, but there’s no information there. Remember that a table is only metadata, and the things that matter are the indexes/heaps that are on it. Checking in sys.indexes shows me that indeed, it’s in data_space_id 3.

    CREATE TABLE dbo.OrderDates (OrderDate date PRIMARY KEY, NumOrders int NOT NULL DEFAULT 0) ON FG3;
    GO
    SELECT *
    FROM sys.objects
    WHERE object_id = OBJECT_ID('dbo.OrderDates');
    GO
    SELECT *
    FROM sys.indexes
    WHERE object_id = OBJECT_ID('dbo.OrderDates');
    GO

    image

    For completeness’ sake, I’m going to put some data in there, using a query that I blogged about yesterday.

    INSERT dbo.OrderDates (OrderDate, NumOrders)
    SELECT TOP (DATEDIFF(DAY,'20010101','20130312'))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1))
    GO

    image

    But the whole point of this was to see what happens with the non-clustered index, which I want to be on recent data only (ie, filtered), and I want it to be in the DEFAULT filegroup.

    As I want it in the default group, I won’t specify a filegroup for the index.

    CREATE INDEX ixRecentData ON dbo.OrderDates (OrderDate)
    INCLUDE (NumOrders)
    WHERE OrderDate >= '20130101';
    GO
    SELECT *
    FROM sys.indexes
    WHERE object_id = OBJECT_ID('dbo.OrderDates');

    image

    But look! It’s put the non-clustered index in the same filegroup as the clustered index. This isn’t what I wanted. In fact, it’s almost never what I’d want, because even if the disk performance is identical, it can be nice to have transactions which update both the clustered and non-clustered indexes using different disks.

    Instead, we have to specify it explicitly to tell it to use the filegroup we want.

    DROP INDEX ixRecentData ON dbo.OrderDates
    GO
    CREATE INDEX ixRecentData ON dbo.OrderDates (OrderDate)
    INCLUDE (NumOrders)
    WHERE OrderDate >= '20130101'
    ON FG2;
    GO
    SELECT *
    FROM sys.indexes
    WHERE object_id = OBJECT_ID('dbo.OrderDates');

    image

    It’s a shame to have to be so explicit with these things, but when you create non-clustered indexes, you really should think about which filegroup they’re going onto, because they may well not go onto the default one as you might expect.

    @rob_farley

  • A TOP Query

    For the blog post that I’ll publish tomorrow, I wrote a query that I thought needed a blog post all of its own. This is that post.

    The query was this one. Its results aren’t that interesting, it’s just a list of dates with a random number between 0 and 99. Just some sample data that I thought I’d use.

    SELECT TOP (DATEDIFF(DAY,'20010101','20130312'))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1));

    image

    So, let me give some background...

    When teaching about T-SQL, I often point out that a nums table is tremendously useful. One of its uses is to make a dates table, which can be really handy in a bunch of other ways, like in data warehouses. The idea is simple, assuming you have a table of numbers which starts from either 0 or 1 (I don’t really care, although I start mine from 1). Here I figure that you want to get dates from between 20010101, up to (but not including) today.

    SELECT DATEADD(day, num-1, '20010101')
    FROM dbo.nums
    WHERE num <= DATEDIFF(day, '20010101', SYSDATETIME());

    I’m not going to show you the results, I figure that it’s easy for you to picture a list of dates.

    Oh, alright then.

    image

    Anyway, moving on.

    In this case, I didn’t have a nums table handy, and for that, I tend to use ROW_NUMBER() and the table master.dbo.spt_values. This table contains a bunch of rows of handy reference data. I wasn’t interested in the contents though, I just wanted to have enough rows, and it’s quick to reference that rather than writing a bunch of self-joining CTEs. There’s over 2000 rows (2515 to be exact) in master.dbo.spt_values, and if I want to have up to 4 million, I just do a cross join to itself.

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2;

    image

    This query gives me a lot of rows, of course... but if I use it in a sub-query (or CTE) and filter it, then the simplification work that the Query Optimizer does will mean that it doesn’t try to work out all 4 million rows for me, it’ll stop when it’s seen enough.

    As an example, I can use this in conjunction with my earlier query, which had a WHERE clause.

    WITH nums AS (
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    )
    SELECT DATEADD(day, num-1, '20010101')
    FROM nums
    WHERE num <= DATEDIFF(day, '20010101', SYSDATETIME());

    This produces my rows in less than a second, giving the same results as when we had a physical nums table.

    To include my random values, I’m using the commonly found randomizing method of ABS(CHECKSUM(NEWID())) % 100. RAND() is no good, it just produces the same value for every row. NEWID() is much better, but it’s not a number. CHECKSUM solves that, but can be negative. ABS will wrap that up nicely and give a random number in a large range. Mod 100 solves that.

    WITH nums AS (
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    )
    SELECT DATEADD(day, num-1, '20010101') AS OrderDate,
           ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM nums
    WHERE num <= DATEDIFF(day, '20010101', SYSDATETIME());

    image

    This works nicely, and is a common pattern for dealing with this kind of thing.

    But it’s not the query at the top of this post. That was done without a CTE, and used TOP instead, ordering by the ROW_NUMBER().

    Let’s have a look at it.

    SELECT TOP (DATEDIFF(DAY,'20010101','20130312'))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1));

    We can see that the FROM clause is the same as in our nums CTE. And the same randomising bit for NumOrders is in there.

    But instead of using ROW_NUMBER() OVER (ORDER BY (SELECT 1)) to define a num column, we’re using it within another function? Can we do that? Yes. ROW_NUMBER() can only be used in the SELECT clause and in the ORDER BY clause, and there are restrictions about using it within aggregate functions as you might expect, but here we’re just using it as a value which changes with every row, and there’s really no problem at all.

    We don’t have a filter though. In our CTE version, we used a filter to make sure we weren’t using every row from our CROSS JOIN. Here, we’re using TOP instead. But not TOP with some number – TOP with a function in it! This has been possible for a while, and it even supports sub-queries that produce numbers, in the form (including the double-bracket): SELECT TOP ((SELECT SomeVal ... )) Col1, Col2...

    TOP appears in this query because I needed to limit the number of rows coming back. I couldn’t use the WHERE clause, because I didn’t have anything to filter on. So I used TOP, and had to use a function in there.

    So let’s compare the plans.

    To be fair, I’ll use the fixed date in both queries (for now).

    image

    Oh, how disappointing! My TOP query is 79% of the batch, and the CTE version is 21%. Clearly my new one is 4 times as bad, and I should give up on it.

    Actually, if you run these two queries on your own machine, you’ll see the first one isn’t 4 times as bad at all. It’s actually FASTER than the second. Something’s going on, and we should find out what.

    The plans look remarkably similar. In fact, the second one is identical, but has an extra Filter operator. I don’t have a Filter in the TOP one, but I didn’t really expect it to make that much difference.

    Otherwise, the plans look pretty similar. They both use a Row Count Spool, have a Sequence Project to work out the ROW_NUMBER, and they both use a Top operator – even the second one which doesn’t use TOP.

    But you see, the Query Optimizer would have seen that I was filtering on a column that mapped to ROW_NUMBER(), and that I was doing a “less than” operation there. That’s like doing a TOP, and the Query Optimizer sees benefit in this. It doesn’t explain what’s going on though with the “4 times worse” thing though.

    Let’s examine some of the numbers. Interestingly, we see that the Nested Loop operator expects to do almost no work in the second plan, and 27% in the first. There’s a warning on there – that’s just because I’m using a CROSS JOIN, and I’m okay with that.

    The word ‘expect’ was very important in that last paragraph. The percentages there are based on the work that is expected to be done. Let’s look at the properties of the two Nested Loop operators.

    image image

    Glancing down here, we see a lot of it is the same, but the Estimated Number of Rows in the first one is 4453 (which is correct), while the second one is only 100 (not quite so correct). The arrows on the upper side of the Nested Loops show the effect of this.

    image image

    The second one figures it’ll have seen all the rows it needs to see before it gets a second row from the first table, whereas the second one things it might need 1.77058 rows (being 4453/2515). No wonder the second query thinks it’s quicker.

    Let’s see where this 100 estimate comes from though. Perhaps it’s that filter?

    Based on the estimates of the rows going into the filter and coming out of it, it expects that the filter will see 100 rows and return just 30. This is bad too, but it’s not as bad as the 100 v 4453 impact.

    This thing actually comes from the way that the Query Optimizer has turned the WHERE clause into a TOP clause. It figures there’s some doubt there, and guesses that 100 rows is probably not a bad place to start. When we give an explicit value (even using the DATEDIFF function), it can figure out what’s there and use this value. The second query goes part way there and works out that the result of the DATEDIFF is 4453, but simply doesn’t apply it fully to the Top operator, leaving us with that guess.

    image image

    It’d be nice if it could tell that 4453 is never NULL, and simplify this out a bit more, but it simply doesn’t do this.

    To round the post off, let’s consider what happens if we’re using SYSDATETIME() instead of the constant.

    SELECT TOP (DATEDIFF(DAY,'20010101',SYSDATETIME()))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1));

    WITH nums AS
    (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    )
    SELECT DATEADD(day,num-1,'20010101') AS OrderDate, ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM nums
    WHERE num <= DATEDIFF(DAY,'20010101',SYSDATETIME());

    image

    Oh! Now the first one is simpler still, leaving out the Row Count Spool operator, and thinking it’s going to be cheaper than the second one. Having not trusted that figure before, does this mean the first one is actually worse? Well, we have an idea about where to look – the estimates on some of the arrows, particularly near the Top operator.

    image image

    Oh no! Our first query thinks there’s now only one row going to be coming through. How awful! (Oh, and the number 4452 is fine, because I’m actually running this on March 11th, not March 12th, it’s just that March 12th is T-SQL Tuesday this month, which is what I was writing the query for).

    If you run this on your machine, hopefully you saw something different. Really.

    You see, this problem has kinda been fixed, and if you enable the documented traceflag 4199, it should be better – for the first query at least.

    By turning on trace flag 4199, and telling it to ignore what’s in the cache, it will evaluate SYSDATETIME() for that particular execution, and therefore come up with the right value for the Top operator. It doesn’t fix the WHERE clause version, but it does solve the TOP clause version.

    SELECT TOP (DATEDIFF(DAY,'20010101',SYSDATETIME()))
        DATEADD(day,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20010101') AS OrderDate,
        ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT 1))
    OPTION (QUERYTRACEON 4199, RECOMPILE);

    WITH nums AS
    (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS num
    FROM master.dbo.spt_values t1 CROSS JOIN master.dbo.spt_values t2
    )
    SELECT DATEADD(day,num-1,'20010101') AS OrderDate, ABS(CHECKSUM(NEWID())) % 100 as NumOrders
    FROM nums
    WHERE num <= DATEDIFF(DAY,'20010101',SYSDATETIME())
    OPTION (QUERYTRACEON 4199, RECOMPILE);

    image

    The reason why I say this shouldn’t’ve happened on your system is because you’re probably using trace flag 4199 as a start-up parameter.

    So there you have it... a query which might seem strange at first glance, but is actually a really nice alternative. Don’t be afraid to use expressions in your TOP clause – it’s a very powerful mechanism, and TOP is a great operator to see in your plans (because they will ask for fewer rows from your Scans and Seeks – remember how execution plans suck?). As is often the case, we’re tripped up by estimates being out, but if you can see what’s actually going on, you should be able to make good decisions about how to form your queries.

    @rob_farley

  • Why should you bother with the PASS BA Conference this April?

    I mean really? Why should you spend some of your training budget to go to this thing?

    Suppose you’re someone in the PASS Community who mainly looks after people’s data. That could involve database administration, performance tuning, helping developers write queries, that kind of thing. What part of “Advanced Analytics and Insights”, “Big Data Innovations and Integration”, “Data Analytics and Visualization”, “Information Delivery and Collaboration” or “Strategy and Architecture” is relevant to you? It sounds all well and good for the BI team, who’s thinking about cubes and models and report subscriptions and Power something, but that’s not you.

    The problem is that as data professionals, we’re no longer just database administrators. The world has become more demanding than that. Maybe it’s because of the financial difficulties that the western world has been facing. Maybe it’s because we’ve out-grown our jobs as database administrators. Maybe we’re simply being asked for more than we were before.

    Now more than ever before, if you’re a data professional, you need to be thinking about more than just transaction logs, corruption checking, and backups. You need to be thinking about the overall data story. You can tune your databases to cope with the large amount of data that’s pouring into them, as more and more systems produce consumable data. You can work with your developers to  help them understand the significance of indexes to be able to get the data out faster. But is this really enough?

    Today, we need to be strategic about the data. An increasing number of companies are moving their data to the cloud, where the need for database administrators is not quite the same as it has been in the past. There are a number of tools out there to allow you to manage hundreds, even thousands of database servers, putting pressure on you to be providing more from your role.

    And then you get asked into meetings! People more senior than you asking about what can be done with the data. Can you offer more than just a comment about how much they can trust you to make sure their data is available?

    This is why you need to be looking at things like the Business Analytics Conference. It’s because you need to know how to make the data that you look after more relevant to the organisation who entrusts you with it. You need to know how to get insight from that data. You need to know how to visualise it effectively. You need to know how to make it visible through portals such as SharePoint.

    And you need to know WHY these things are important.

    Either that, or you need to call in external consultants, who can provide these kind of services. You know how to get in touch. ;)

    @rob_farley

    PS: I should mention that I’m on the PASS board, so I see a lot of stuff about this conference. I’m not part of the organising committee at all though, and have been remarkably separate from the whole process. I do consider that this conference is about helping people achieve more within the data space, and that’s something I think more people should be taken advantage of.

  • Behind the scenes of PowerShell and SQL

    Every year, PowerShell increases its stranglehold on the Windows Server system and the applications that run upon it – with good reason too. Its consistent mechanisms for interaction between its scripting interface and the underlying systems make it easy for people to feel comfortable, and there is a discoverability that has been lacking in many other scripting environments.

    Of course, SQL Server hasn’t been overlooked at all, and it’s coming up to five years since the SnapIns were made available (even longer since people started to dabble with SQL using PowerShell).

    But what’s going on behind the scenes? Does PowerShell present a threat to those amongst us who will always prefer T-SQL? Does PowerShell give us new options that are not available any other way? Well, let’s have a bit of a look, especially since this month’s T-SQL Tuesday (hosted by Wayne Sheffield who tweets as @DBAWayne) is on the topic of PowerShell.

    TSQL2sDay150x150

    So we know PowerShell is useful. However we spin it up, we can quickly jump into writing commands, whether it be interacting with WMI, hooking into some .Net assembly we’ve loaded up, or simply browsing the file system. I’ve developed a tendency to use it to start whichever SQL instances I’m needing for the day – by default I have all of them turned off, since I don’t know which one I’ll be wanting most.

    If we’re going to be interacting with SQL, then it’s easiest to either load up the SQLPS environment directly (there’s a PowerShell shortcut within Management Studio), or else (as I do), start a PowerShell window with the Snapin loaded. I prefer this later option, as the SQLPS environment is a slightly cut-back version of PowerShell. But either way – the stuff I’ll continue on with is essentially the same whichever environment you use.

    If you’ve talked about SQL with me long enough, you’ll have come across the fact that I often use SQL Profiler when I’m curious about where to find information. My former colleague Jim McLeod (@Jim_McLeod) blogged about this a few months ago, with an example that I remember looking through with him four or five years ago. It’s a great technique that works on all kinds of things, even across different versions of SQL Server. It also adds as a terrific reminder that Management Studio is not a special application, it simply knows how to ask for the pieces of information that it shows.

    But PowerShell (or SMO, for that matter), that’s in the .Net world. Surely that would be able to bypass the clunky T-SQL stuff that Management Studio does... I mean, Management Studio has to be flexible enough to work across remote servers, talking through firewalls that only allow T-SQL interfaces. Surely PowerShell has access to a deeper magic.

    Well, no. PowerShell still lets you talk to remote servers, and ends up using the same methods.

    Let’s prove it.

    Spin up Profiler, and start a trace against your favourite instance. I like to watch for SQL:BatchCompleted, SP:StmtCompleted and RPC:Completed events when doing this kind of thing. I’m using an instance that isn’t doing anything else, but you could apply a ColumnFilter to filter the events to things with an ApplicationName starting with SQLPS if you prefer.

    With that running, I jump into PowerShell and do something like:

    PS SQLSERVER:\sql\localhost\sql2008r2> dir Databases | ft name

    This lists the names of the databases on my SQL2008R2 instances. You don’t need to see the results, you can imagine them for yourself.

    If PowerShell were using some secret interface, it’s unlikely we’d see something in Profiler. But it’s not, and we see a bunch of stuff.

    image

    We see a bunch of Remote Procedure Calls, each with a Stored Procedure Statement Completed event showing the same information. And look – we see queries against master.sys.databases, asking for the name of each of the databases, passing in the name as a parameter. Brilliant! Notice just a bit earlier though, there’s a SQL:BatchCompleted call. This means that a query has been passed in directly. It’s this:

    SELECT
    CAST(
            serverproperty(N'Servername')
           AS sysname) AS [Server_Name],
    dtb.name AS [Name]
    FROM
    master.sys.databases AS dtb
    WHERE
    (CAST(case when dtb.name in ('master','model','msdb','tempdb') then 1 else dtb.is_distributor end AS bit)=0)
    ORDER BY
    [Name] ASC

    image

    So it grabs the list of database names first, and then makes extra calls to be able to fetch the list of names again, one by one.

    The reason why it’s grabbing the list of names one by one isn’t because it’s stupid and is asking to be ridiculed. It’s because we've asked to see that property, and I guess the PowerShell people figured that no matter what property you ask for, it’ll go and fetch it to show you.

    When I asked for the CompatibilityLevel property instead, I got some different rows thrown in. Interestingly though, it still asked for the name each time.

    image

    Also interestingly, when I asked for the CompatibilityLevel a subsequent time, the calls for “SELECT dtb.compatibility_level…” weren’t in there. They’d been cached by the PowerShell environment – important to note if you ever come across PowerShell giving you old values.

    So what about asking something more interesting? Let’s try asking about the IndexSpaceUsage in AdventureWorks.

    PS SQLSERVER:\sql\localhost\sql2008r2> gi Databases\AdventureWorks | ft IndexSpaceUsage

    The result tells me it’s 62576. Yeah, but today I’m not interested in that, just what happened in the trace.

    Four entries. An SP:StmtCompleted with matching RPC:Completed, and two SQL:BatchCompleted.

    The SP:StmtCompleted and RPC:Completed were this statement, passing in the parameter value ‘AdventureWorks’. Clearly incredibly informative.

    SELECT
    CAST(0 AS float) AS [IndexSpaceUsage],
    dtb.name AS [DatabaseName]
    FROM
    master.sys.databases AS dtb
    WHERE
    (dtb.name=@_msparam_0)

    This is getting the value zero. Wow. Brilliant stuff.

    The last entry – the second of the two SQL:BatchCompleted events is:

    select convert(float,low/1024.) from master.dbo.spt_values where number = 1 and type = 'E'

    If you run this in Management Studio, you’ll discover it gives the value 8. Ok.

    The other entry is more interesting.

    use [AdventureWorks]
    SELECT
    SUM(CASE WHEN a.type <> 1 THEN a.used_pages WHEN p.index_id < 2 THEN a.data_pages ELSE 0 END) AS [DataSpaceUsage],
    SUM(a.used_pages) AS [IndexSpaceTotal]
    FROM
    sys.allocation_units AS a INNER JOIN sys.partitions AS p ON (a.type = 2 AND p.partition_id = a.container_id) OR (a.type IN (1,3) AND p.hobt_id = a.container_id)

    This is more like it! We run this in Management Studio, and we see two values. DataSpaceUsage is 13682, IndexSpaceTotal is 21504. Neither are our value 62576. But we do have clues in the column names, and in that value 8 that came back too. We can easily deduce that it’s actually (IndexSpaceTotal-DataSpaceUsage)*8, and we have ourselves a nice little method for working out the IndexSpaceUsage ourselves now if we need it.

    Or we can just ask PowerShell next time as well.

    Incidentally – if you’re considering doing the MCM Lab exam some time, then you might find that a familiarity with PowerShell comes in really handy. I’m not saying there are PowerShell questions on the exam at all – I’m just suggesting that you may find that PowerShell becomes a really useful way of getting at some of the information that you’re looking for. If you’re stumbling around the list of DMVs trying to remember which one it is that stores some particular thing, remember that you might be able to get the data out more easily if you use PowerShell instead.

    So can we discover secret things about SQL from PowerShell? Are there things we can do in PowerShell that are impossible through other mechanisms? Hooks that let us break the rules even?

    Recently, Kendal van Dyke asked a question about this kind of thing on Twitter. He was wondering if you could have a default constraint on a column in a view. The reason for his wondering was that he saw a property on a view column in PowerShell that made him wonder. The answer is no though, and there’s a simple reason.

    PowerShell is a programmatic interface. It involves classes and property and methods. It does things row by row, which is why much of what you see in that trace feels amazingly pedantic – asking about things which shouldn’t have to be that complicated. The implication of this though, is that PowerShell reuses the concept of a column, regardless of whether this is a column in a table, a view, or anywhere else it decides to need a column. The fact that columns in tables have some extra properties isn’t enough to make this class re-use pointless. If we try to set a Default constraint for a column in a view though, we get an error, just like if we tried to do it any other way.

    The PowerShell I used was:

    $db = Get-Item SQLSERVER:\sql\localhost\sql2008r2\Databases\AdventureWorks
    $def = New-Object -TypeName Microsoft.SqlServer.Management.SMO.Default –ArgumentList $db, "ViewDefault"
    $def.TextHeader = "CREATE DEFAULT ViewDefault AS"
    $def.TextBody = "'ABC'"
    $def.Create()
    $def.BindToColumn("vStateProvinceCountryRegion","StateProvinceCode","Person")

    The code that ended up getting called was to the stored procedure sp_bindefault (despite it being deprecated). Naturally, trying to execute this against a view column gives an error regardless of what wrappers you have put around it – PowerShell or not.

    image

    So PowerShell is very useful, and it provides a handy way of getting to a lot of things that could be otherwise hard. But looking below the surface, it isn’t able to circumvent the basic restrictions of SQL Server, because it still ends up doing its work using T-SQL.

    @rob_farley

  • Running goals

    I’m not big on New Year resolutions. I can’t say I respond particularly well to “everyone’s doing something, I should too” situations. Peer pressure can be useful at times, but I also find that it can make me even more stubborn.

    TSQL2sDay150x150So when Jason Brimhall chose the topic for this month’s T-SQL Tuesday as “standing firm”, considering the ideas of “resolve, resolution, resolute”, I scowled a little, and wondered what I’d write about. I considered writing about my MCM journey (although that’s overfor a few months), or thinking about some business-related thing – but the closest I have to a resolution this year is related to running. I get that it’s not about SQL, but it seems very related to various members of the SQL community. You know who you are – don’t make me list you.

    Many of you won’t know that this May (2013) will mark ten years since I had a nasty back injury. I don’t have a dramatic story to tell about it – it happened at work at the water cooler. I simply bent over wrong, and years of having a job where I sit down a lot, cycling (I used to cycle to work most days, every day when I was living in London), and being a little taller than most finally took its toll. Suddenly there was pain, and I ended up in hospital being told I might not get to walk again, and to expect to be in there until some time in July.

    Prayer got me out in ten days.

    But there were things I lost. I’ve never since been able to pick up my sons (they were five and three at the time). I’ve never been able to ride a bike (although I haven’t forgotten how). Until 2012, I’d been unable to run. I have chronic pain, and I’ve got used to it (which is annoying, but I’ve learned to tolerate it. That’s not always a good thing, like the time I had a badly inflamed appendix and nearly got sent home – although my pain tolerance did mean I could fly to Seattle a few days after the operation).

    In 2011, I started to walk more. I walked about eight miles one time just to see if I could. The pain level was higher, as expected, but I managed it. Later that year, I walked a half marathon in Portland, Oregon, just before the PASS Summit. I even beat one of the people in the group who was supposedly running (I felt bad and wanted to wait for him, but I think if I’d stopped I wouldn’t’ve been able to finish).

    I kept walking, but in June 2012, I talked to my physio about getting running back. He’d supported me through the walking, and knew that it wasn’t causing me too much bother any more. He got me on the treadmill, then onto grass, then onto the road. It hurt more, but I was surviving. Annoyingly, a cough I had developed into more and I ended up being sick for the best part of June-August, but I was running when I could, and I was surviving.

    In November, I didn’t just walk a half marathon, I ran it. Annoyingly I had tendinitis in my right knee, which meant I had acute pain almost the whole way around the San Francisco course. I didn’t stop running though, and although my pace dropped so much I could’ve walked most of it faster, I got to the end. Last month, December 2012, I agreed to do at least 15 minutes of cardio every day, and I managed it. I ran over 75 miles during the month, and I’m sure my fitness level is improving.

    So what’s in store for 2013 on the running scene?

    Well, I want to get a better time in a half marathon. That shouldn’t be hard – I just need to run one when I’m not injured.

    I wouldn’t mind trying a full marathon. I’ve joined the South Australian Road Runners’ Club, and there are some marathons I could aim for during the year. There are also some trips to the US that I’ll be doing, and I could try to do one over there.

    I want to keep going with doing some cardio every day. My weight is dropping and I would like to get back to close what I was before my injury. I don’t expect to get under it – I’m not in my 20s any more – but I would like to get close. In the next few weeks I should get under 200lb again (90.7kg), and would like to be able to reach 86kg. The goal isn’t just to lose weight, rather to keep increasing the fitness level and see my back improve. Too many times in the past ten years I’ve had such bad times with my back that I’ve been unable to stand straight, or walk more than a few paces without stopping and holding onto something. On those days, I wouldn’t’ve been able to do any cardio at all, and I praise God massively for the quality of life I have when those times aren’t haunting me.

    I don’t know if I can manage to run a particular distance in 2013. If I could average the same as December, that would get me to 900 miles for the year. That should mean that 1000 could be possible, but 750 miles could be a more realistic goal. I’d still be thrilled to be able to achieve that.

    So if you follow me on Twitter and see one of these tweets, feel free to hit reply and send me a message. While I’m running, every tweet that starts with @rob_farley will get read out to me by the Runmeter app on my phone. That’s really encouraging, and it helps me know that I have supporters. Thanks in advance, because you’ll be helping me be more resolute.

    @rob_farley

    PS: If you’ve read all the way down to here, go and donate some money to the Leukaemia & Lymphoma Society, supporting Yanni Robel who will run 39.3 miles this weekend. I can’t see myself trying that particular feat, but two years ago, Yanni wasn’t a runner either. Both she and her husband John are good friends of mine who have encouraged me immensely. So go donate some money now and help her reach her fundraising goal too. http://pages.teamintraining.org/wa/wdw13/yrobel

  • MCM – I passed!

    I was wrong – I passed the MCM lab last month.

    I know I thought I’d failed – I’m still surprised that I passed. Maybe I managed to satisfy enough of enough questions. Certainly none of the questions were beyond me, as I wrote just after I’d sat it. But I do know that I left more questions in an incomplete state than I would’ve liked.

    The things I hadn’t finished were mostly those things which I’d simply been caught by from a time-perspective. I didn’t feel like I was out of my depth at all, just that some things hadn’t worked the way that I’d expected. In real life I would’ve been able to solve them, and given a little more time, I would’ve been able to get through them too.

    It’s why I wrote that I was confident I’d pass it on a subsequent attempt, although I’m quite happy to not have to do that. I’m still happy I told everyone I was taking the exams, and would do the same again. The study aspect is still something I’m not sure about. As much as I’d like to advise people to NOT study, I get that most people don’t manage to pass first time (even with some study). Heck – it used to require three weeks of intensive training in Seattle, and I was never going to do that (at least, not unless I was on the other side of the desk).

    So now I’m a Microsoft Certified Master in SQL Server, to go with the Microsoft Certified Trainer cert and Microsoft Most Valuable Professional award that I’ve had since 2006. I don’t know how many people around the world have both MCM and MCT in SQL Server, but there can’t be many. I need to deliver more training.

    MCMMVP_Horizontal_FullColormct_big

    @rob_farley
    LobsterPot Solutions

  • The SQL Query Optimizer – when Logical Order can get it wrong

    It’s very easy to get in the habit of imagining the way that a query should work based on the Logical Order of query processing – the idea that the FROM clause gets evaluated first, followed by the WHERE clause, GROUP BY, and so on – finally ending with whatever is in the SELECT clause. We even get in the habit of creating indexes that focus on the WHERE clause, and this is mostly right.

    But it’s only mostly right, and it will often depend on statistics.

    There are other situations where statistics have to play a major part in choosing the right plan, of course. In fact, almost every query you ever run will use statistics to work out the best plan. What I’m going to show you in this post is an example of how the statistics end up being incredibly vital in choosing the right plan. It also helps demonstrate an important feature of the way that Scans work, and how to read execution plans.

    I’m going to use AdventureWorks2012 for this example. I’m going to ask for the cheapest product according to the first letter of the product name. This kind of query:

    SELECT MIN(ListPrice)
    FROM Production.Product
    WHERE Name LIKE 'H%';

    Don’t run it yet. I want to ask you how you’d solve it on paper.

    Would you prefer I give you a list of the products sorted by name, or would you prefer I give you that list sorted by price?

    If you want the ‘sorted by name’ option, then you’ll have to look through all the products that start with H, and work out which is the cheapest (notice that my predicate is not an equality predicate – if I knew what the name had to be exactly, then I could have an index which ordered by name and then price, and very quickly find the cheapest with that name). This approach could be good if you don’t have many products starting with that particular letter. But if you have lots, then finding them all and then looking for the cheapest of them could feel like too much work. Funnily enough, this is the way that most people would imagine this query being run – applying the WHERE clause first, and applying the aggregate function after that.

    On the other hand, if you have lots of products with that particular letter, you might be better off with your list sorted by price, looking through for the first product that starts with the right letter.

    Let me explain this algorithm a little more.

    If you’re at a restaurant and are strapped for cash, you might want to see what the cheapest thing is. You’d pick the “sorted by price” menu, and go to the first item. But then if you saw it had peanut in, and you have an allergy, then you’d skip it and go to the next one. You wouldn’t expect to have to look far to find one that doesn’t have peanut, and because you’ve got the “sorted by price” menu, you have the cheapest one that satisfies your condition after looking through just a few records.

    It’s clearly not the same algorithm as finding all the things that satisfy the condition first, but it’s just as valid. If you’re only going to have to look through a handful of products before you find one that starts with the right letter, then great! But what if there are none? You’d end up having to look through the whole list before you realised.

    The Query Optimizer faces the same dilemma, but luckily it might have statistics, so it should be able to know which will suit better.

    Let’s create the two indexes – one sorted by Name, one sorted by Price. Both will include the other column, so that the query will only need one of them.

    CREATE INDEX ixNamePrice ON Production.Product (Name) INCLUDE (ListPrice);
    CREATE INDEX ixPriceName ON Production.Product (ListPrice) INCLUDE (Name);

    Now let’s consider two queries. Both queries give the same result – $0.00. But that’s not important, I’m only interested in how they run.

    SELECT MIN(ListPrice)
    FROM Production.Product
    WHERE Name LIKE 'I%';

    SELECT MIN(ListPrice)
    FROM Production.Product
    WHERE Name LIKE 'H%';

    The two queries are almost identical, but they run quite differently.

    image

    Ok, they’re fairly similar – they both use a Stream Aggregate operator, for example. And they have similar cost. But significantly, one is performing a Seek, while the other is doing a Scan. Different indexes, but nevertheless a Scan and a Seek.

    People will tell you that Scans are bad and Seeks are good, but it’s not necessarily the case. Here, we see that the Scan plan is no more expensive than the Seek plan – it’s just different. We should consider why.

    Those two indexes are the two different stories that I described earlier. There are very few products that start with the letter ‘I’, and quite a number than start with ‘H’, and so the Query Optimizer has chosen differently.

    There are exactly 10 products that start with I. From a total of 504. That’s less than 2% of the products.

    There are 91 products that start with H. That’s 18%. You might not have expected it to be that high, but that’s okay – if SQL has been maintaining statistics for you on this, it hopefully won’t be as surprised as you.

    18% – nearly 1 in 5. So by the time you’ve looked at, oh, a dozen records, you will have almost certainly found one that starts with an H. (Actually, the chance of NOT finding one in the first 12 would be power(.82, 12), which is 0.09. That’s just 9%.) If I do a bit of digging into the internals, I can discover that the pages in my index typically have over a hundred records on them each. The chance of not finding a product that starts with an H on that first page – you’d need lottery-scale luck (1 in 444 million).

    On the other hand, the cost of finding the cheapest value from 91 records is a lot more expensive than finding the cheapest from just 10. And getting all 10 records should be a small number of reads too.

    But a Scan! Really? It has to look through the whole table, right?

    No. That’s not how it works.

    You see, execution plans go from left to right. If you start reading these plans from the right, you’ll start thinking that the whole index has been scanned, when it’s simply not the case. That Top operator asks for a single row from the index, and that’s all it provides. Once that row has been found, the Scan operation stops.

    For this information, I don’t even need to pull up the Properties window for the Scan (but I would recommend you get in the habit of doing that). No – this is all available in the Tool Tip. Look at the number of “Actual number of rows” – it’s just one.

    image

    A predicate is applied – it looks through the index for rows that start with H – but it’s doing this in Order (see Ordered = True), and it’s stopping after the first row is found. Remember I mentioned that there are actually 91 rows that satisfy the predicate? The Scan doesn’t care – it only needs one and it stops right then.

    You might figure this is because we are using MIN. What if we needed the MAX though? Well, that’s just the same, except that the Direction of the Scan is BACKWARD (you’ll need F4 for that one).

    image

    MIN goes forward, because it’s most interested in the ‘smallest’ ones, MAX will go backward because it wants the ‘largest’. (And as you’d probably expect, if you’d created your index to be descending, then it would be reversed.)

    But again – being able to tell which is the better algorithm depends entirely on your statistics being known.

    I see so many systems have bad statistics for one reason or another, and typically because the data most frequently queried is the newest data, and that makes up such a small percentage of the table. The statistics will think that there is almost no data for ‘today’, as they probably haven’t been updated since at least some number of hours ago.

    When you look at how a query is running, always have a think about you’d solve it on paper, and remember that you might actually have a better (or worse) picture of the statistics than what the Query Optimizer has.

    And remember that a Scan is not necessarily bad. I might do another post on that soon as well.

    @rob_farley

  • Portfolio changeover time for the PASS board

    The last PASS board meeting for the year has happened, and the portfolio handovers are well under way.

    Sadly, having new board members elected means having existing board members step down, and this last board meeting was the last one for both Allen Kinsel (@sqlinsaneo) and Kendal van Dyke (@sqldba). In 2012, these guys had the portfolios of local chapters and SQL Saturdays, respectively.

    Newly elected board member Wendy Pastrick (@wendy_dance) is taking over from Allen on local chapters, while I’m taking over SQL Saturdays from Kendal. In 2012, my portfolio was 24 Hours of PASS, which is being rolled into the Virtual Chapters portfolio, headed still by Denise McInerney (@denisemc06).

    I have to admit that I’m really excited that the 24HOP portfolio is being merged with Virtual Chapters, as the two are so linked. I had been on the 24HOP committee before I joined the PASS board, and had recommended that the two portfolios be merged around the time I was elected to the board. During my term I even recruited Virtual Chapter leaders to be on the committee for 24HOP, as I believe their experience in the online experience makes them best suited to influence PASS’ premier online event – the semi-annual 24HOP.

    2012 was a good year for 24HOP, although it was the riskiest for some time as well.

    Two of the more obvious changes that we made were to look at a new platform, and to return to the 24-hours straight format (rather than two 12-hour blocks). This more continuous format meant that numbers dropped (the largest audience is in the US, so any sessions that are overnight for the US are obviously going to have smaller attendance). However, this format meant we reached over 100 different countries, which I think was really significant. Comparing the first 2012 event with the first 2011 event (which used the 2x12 format), we jumped from reaching 54 countries in 2011 to 104 in 2012.

    imageWhile I was still on the committee, we had discussed the need for a new platform, as the LiveMeeting platform wasn’t coping well with the numbers we were seeing. A number of options had been considered, some too expensive, some not capable of scaling sufficiently, and a decision had been made to use a platform called IBTalk. It was obviously more expensive than LiveMeeting (which had been available for free), but looked like it was going to scale much more nicely. We used it for both 2012 events and it will also be used for the next event (on Jan 30). The decision to use IBTalk was very risky, but as an experiment it seemed to work okay. There were both good and bad elements of the platform, which I’m not going to go into in a forum like this, although the second event that we used IBTalk for ended up being much smoother than the first, and I anticipate that the Jan30 event will be event smoother still.

    I felt like the first event of 2012 was dominated by the new platform. It was held two weeks after the SQL Server 2012 launch, which had also been a large virtual event using a new platform. I guess experimenting with new platforms was a large topic of discussion that month. One thing that didn’t really work for us was the closed captioning. It turns out that when you have someone providing closed captioning live, any typos that come through, or anything that is misheard by the person providing the service, etc… well, it doesn’t always work for being able to feed a translation service. We tried, and it was good to try – but it didn’t work so well. Despite all that, PASS members can view the session recordings at http://www.sqlpass.org/24hours/spring2012/Home.aspx

    The main 24HOP event in the second half of the year was the annual Summit Preview event. We didn’t try to pursue the closed captioning again, but we did continue with IBTalk. Going back to LiveMeeting was never going to be an option for us, and we wanted to take a second look at the platform, in light of the various things we’d learned from the experience in Q1. It was a better experience from a number of perspectives, and we certainly got to test the scalability.

    Over the course of the day, we had good numbers – only a handful shy of 10,000 attendees across the course of the day (okay, a handful if you count fingers, toes, and were inbred – we had 9979). The lowest attendances were around the 100 mark, but the largest reached 1421 attendees. The highest from any previous events was around the 800 mark, so this was a significant improvement – and the platform handled it just fine. If we’d had that many people trying to access the LiveMeeting platform it simply wouldn’t’ve coped, and the IBTalk platform deserves praise for that.

    The platform decision isn’t over yet. A new search has taken place in light of what we’ve learned in the past year, and including a lot of what people have expressed to us on platforms such as Twitter. There are platforms that are way out of our price range (it can be very expensive to present 10,000 man-hours of content using some platforms), and there are ones that won’t cope with some of the things we would like to do. With some of the Virtual Chapters growing fast, a new platform needs to be able to cope with them too, with a wide variety of attendances needing to be handled. I wish Denise all the best for that, and have been able to happily assure her that the PASS HQ team that does most of the work behind the scenes for 24HOP (particularly Jane and Vicki) is excellent and will make her look brilliant this year.

    Another change in 2012 was the sponsorship story. For a long time, Dell had been a major sponsor of 24HOP, and I want to thank them for that. However, 24HOP wasn’t a priority for them in 2012, and new sponsors needed to be found. The first event saw sponsorship come in from Microsoft, SQL Sentry and Idera, with Idera being replaced by RSSBus for the second event. But what really excited me was to see a second tier of sponsors join the fray, with Melissa Data and Confio joining Idera as ‘Alliance Sponsors’. It was really good to have six fantastic companies sponsoring the event, and providing extra options for them.

    I haven’t even mentioned the non-English events that have taken place! PASS has seen 24HOP events in Russian, Portuguese and Spanish this year, although my personal involvement with those events have been somewhat less. Again, the PASS HQ staff have been great in enabling these events, and helping them run smoothly.

    So I leave 24HOP in capable hands.

    Instead, I pick up the SQL Saturday portfolio – another fast-growing facet of PASS. Already the 200th SQL Saturday event has been scheduled, and I’m finding myself getting onto a moving train. Luckily, I won’t be battling anyone on the roof Bond-style, but there are a lot of things that will need attention to make sure that the SQL Saturday model can continue to be successful.

    imageThe PASS HQ staff most responsible for the SQL Saturdays that happen all around the world are Karla and Niko. If you’ve ever met either of these two, you’ll know that they run quickly and are nothing if not achievers. I suspect that I could just tell them to keep doing their thing and the portfolio would be considered successful. This is incredibly useful to me, because I should be able to focus on identifying and solving some of the things that might need to change as these events become larger in both size and number. I’m keen to look into some of the edge cases, such as international events (including non-English), and both the larger and smaller events that are around – but all the time trying to serve Niko, Karla and all the community leaders in what they do.

    @rob_farley

  • SQL Replication demystified

    I spent some time explaining SQL Server Replication to someone recently. They said they hadn’t ever really understood the concepts, and that I’d managed to help. It’s inspired me to write a post that I wouldn’t normally do – a “101” post. I’m not trying to do a fully comprehensive piece on replication, just enough to be able to help you get the concepts.

    The way I like to think about replication is by comparing it to magazines. The analogy only goes so far, but let’s see how we go.

    The things being replicated are articles. A publication (the responsibility of a publisher) is a collection of these articles. At the other end of the process are people with subscriptions. It’s just like when my son got a magazine subscription last Christmas. Every month, the latest set of articles got delivered to our house. FOUR-FOUR-TWO_JAN-13(The image here isn’t my own – but feel free to click on it and subscribe to FourFourTwo – excellent magazine, particularly when they’re doing an article about the Arsenal.) Most of the work is done by agents, such as the newsagent that gets it to my house.

    In SQL Server, these same concepts hold. The objects which are being replicated are articles (typically tables, but also stored procedures, functions, view definitions, and even indexed views). You might not replicate your whole database – just the tables and other objects of interest. These articles make up a publication. Replication is just about getting that stuff to the Subscribers.

    Of course, the magazine analogy breaks down quite quickly. Each time my son got a new edition, the articles were brand new – material he’d never seen before. In SQL Replication, the Subscribers probably have data from earlier. But this brings us to look at a key concept in SQL Replication – how the stupid thing starts.

    Regardless of what kind of replication you’re talking about, the concept is all about keeping Subscribers in sync with the Publisher. You could have the whole table move across every time, but more than likely, you’re going to just have the changes go through. At some point, though, the thing has to get to a starting point.

    This starting point is (typically) done using a snapshot. It’s not a “Database Snapshot” like what you see in the Object Explorer of SQL Server Management Studio – this is just a starting point for replication. It’s a dump of all the data and metadata that make up the articles, and it’s stored on the file system. Not in a database, on the file system. A Subscriber will need this data to be initialised, ready for a stream of changes to be applied.

    It’s worth noting that there is a flavour of replication which just uses snapshots, known as Snapshot Replication. Every time the subscriber gets a refresh of data, it’s the whole publication that has to flow down. This might be fine for small pieces of data, it might not for others.

    (There are other ways to get started too, such as by restoring a backup, but you should still be familiar with the concept of snapshots for replication.)

    To get in sync, a subscriber would need the data in the snapshot for initialisation, and then every change that has happened since. To reduce the effort that would be required if something went drastically wrong and a new subscription became needed, snapshots can be recreated at regular intervals. This is done by the Snapshot Agent, and like all agents, can be found as a SQL Server Agent job.

    The middle-man between the Publisher and the Subscribers is the Distributor. The Distributor is essentially a bunch of configuration items (and as we’ll see later, changes to articles), stored in the distribution database – a system database that is often overlooked. imageIf you query sys.databases on a SQL instance that has been configured as a Distributor you’ll see a row for the distribution database. It won’t have a database_id less than 5, but it will have a value of 1 in the is_distributor column. The instance used as the Distributor is the one whose SQL Server Agent runs most of the replication agents, including the Snapshot Agent.

    If you’re not doing Snapshot Replication, you’re going to want to get those changes through. Transactional Replication, as the name suggests, involves getting transactions that affect the published articles out to the subscribers. If the replication has been set up to push the data through, this should be quite low latency.

    So that SQL Server isn’t having to check every transaction right in the middle of it, there’s a separate agent that looks though the log for transactions that are needed for the replication, copying them across to the distribution database, where they hang around as long as they’re needed. This agent is the Log Reader Agent, and also runs on the Distributor. You can imagine that there is a potential performance hit if this is running on a different machine to the Publisher, and this is one of the influencing factors that means that you’ll typically have the Distributor running on the Publisher (although there are various reasons why you might not).

    Now we have a process which is making sure that initialisation is possible by getting snapshots ready, and another process which is looking for changes to the articles. The agent that gets this data out to Subscribers is the Distribution Agent. Despite its name, it can run at the Subscriber, if the Subscriber is set to pull data across (good for occasionally connected systems). This is like with my magazine – I might prefer to go to the newsagent and pick it up, if I’m not likely to be home when the postman comes around. In effect, my role as Subscriber includes doing some distribution if I want to pull the data through myself.

    These three agents, Snapshot Agent, Log Reader Agent and Distribution Agent, make up the main agents used for Transactional Replication, which is probably the most common type of replication around. Snapshot Replication doesn’t use the Log Reader Agent, but still needs the other two.

    Now let’s consider the other types of replication.

    imageMerge Replication involves having subscribers that can also change the data. It’s similar to Transactional Replication with Updateable Subscribers, which has been deprecated. These changes are sent back to the Merge Agent, which works out what changes have to be applied. This is actually more complicated than you might expect, because it’s very possible to have changes made in multiple places and for a conflict to arise. You can set defaults about who wins, and can override manually through the Replication Monitor (which is generally a useful tool for seeing if Subscribers are sufficiently in sync, testing the latency, and so on). Updateable Subscribers end up using the Queue Reader Agent instead of the Merge Agent. They’re slightly different in the way they run, but I consider them to be quite similar in function, as they both involve getting the data back into the publisher when changes have been made elsewhere.

    Peer-to-Peer Replication is the final kind. This is really a special type of Transactional Replication, in which you have multiple publishers, all pushing data out at each other. It’s the option that is considered closest to a High Availability system, and is good across geographically wide environments, particularly if connections are typically routed to the closest server. Consider the example of servers in the UK, the US and Australia. Australian users can be connected to the local server, knowing the changes are going to be pushed out to the UK and US boxes. They’re set up in a topology, with each server considered a node. Each server keeps track of which updates it’s had, which means they should be able to keep in sync, regardless of when they have downtime. If Australian changes are sent to the UK but not the US, then US can be updated by the UK server if that’s easier.

    Replication can feel complex. There are a lot of concepts that are quite alien to most database administrators. However, the benefits of replication can be significant, and are worth taking advantage of in many situations. They’re an excellent way of keeping data in sync across a number of servers, without many of the server availability hassles associated with log-shipping or mirroring. It can definitely help you achieve scale-out environments, particularly if you consider Peer-to-Peer, which can help you offload your connections to other servers, knowing the key data can be kept up-to-date easily.

    I haven’t tried to be completely comprehensive in this quick overview of replication, but if you’re new to the concepts, or you’re studying for one of the MCITP exams and need to be able to get enough of an understanding to get you by, then I hope this has helped demystify it somewhat.

    There’s more in SQL Books Online, of course – a whole section on Replication. If what I’ve written makes sense, go exploring, and try to get it running on your own servers too. Maybe my next post will cover some of that.

    @rob_farley

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement