SQL Server is quite remarkable in a bunch of ways. In this post, I’m using the way that the Query Optimizer handles LIKE to keep it SARGable, the Execution Plans that result, Collations, and PowerShell to come up with the Swedish alphabet.
SARGability is the ability to seek for items in an index according to a particular set of criteria. If you don’t have SARGability in play, you need to scan the whole index (or table if you don’t have an index). For example, I can find myself in the phonebook easily, because it’s sorted by LastName and I can find Farley in there by moving to the Fs, and so on. I can’t find everyone in my suburb easily, because the phonebook isn’t sorted that way. I can’t even find people who have six letters in their last name, because also the book is sorted by LastName, it’s not sorted by LEN(LastName). This is all stuff I’ve looked at before, including in the talk I gave at SQLBits in October 2010.
If I try to find everyone who’s names start with F, I can do that using a query a bit like:
SELECT LastName FROM dbo.PhoneBook WHERE LEFT(LastName,1) = 'F';
Unfortunately, the Query Optimizer doesn’t realise that all the entries that satisfy LEFT(LastName,1) = 'F' will be together, and it has to scan the whole table to find them.
But if I write:
SELECT LastName FROM dbo.PhoneBook WHERE LastName LIKE 'F%';
then SQL is smart enough to understand this, and performs an Index Seek instead.
To see why, I look further into the plan, in particular, the properties of the Index Seek operator. The ToolTip shows me what I’m after:
You’ll see that it does a Seek to find any entries that are at least F, but not yet G. There’s an extra Predicate in there (a Residual Predicate if you like), which checks that each LastName is really LIKE F% – I suppose it doesn’t consider that the Seek Predicate is quite enough – but most of the benefit is seen by its working out the Seek Predicate, filtering to just the “at least F but not yet G” section of the data.
This got me curious though, particularly about where the G comes from, and whether I could leverage it to create the Swedish alphabet.
I know that in the Swedish language, there are three extra letters that appear at the end of the alphabet. One of them is ä that appears in the word Västerås. It turns out that Västerås is quite hard to find in an index when you’re looking it up in a Swedish map. I talked about this briefly in my five-minute talk on Collation from SQLPASS (the one which was slightly less than serious).
So by looking at the plan, I can work out what the next letter is in the alphabet of the collation used by the column. In other words, if my alphabet were Swedish, I’d be able to tell what the next letter after F is – just in case it’s not G.
It turns out it is… Yes, the Swedish letter after F is G. But I worked this out by using a copy of my PhoneBook table that used the Finnish_Swedish_CI_AI collation. I couldn’t find how the Query Optimizer calculates the G, and my friend Paul White (@SQL_Kiwi) tells me that it’s frustratingly internal to the QO. He’s particularly smart, even if he is from New Zealand.
To investigate further, I decided to do some PowerShell, leveraging the Get-SqlPlan function that I blogged about recently (make sure you also have the SqlServerCmdletSnapin100 snap-in added. To do that, run get-pssnapin -reg to see if it's on the machine, and get-pssnapin -reg | add-pssnapin to make the cmdlet available).
I started by indicating that I was going to use Finnish_Swedish_CI_AI as my collation of choice, and that I’d start whichever letter came straight after the number 9. I figure that this is a cheat’s way of guessing the first letter of the alphabet (but it doesn’t actually work in Unicode – luckily I’m using varchar not nvarchar. Actually, there are a few aspects of this code that only work using ASCII, so apologies if you were wanting to apply it to Greek, Japanese, etc). I also initialised my $alphabet variable.
$collation = 'Finnish_Swedish_CI_AI';
$firstletter = '9';
$alphabet = '';
Now I created the table for my test. A single field would do, and putting a Clustered Index on it would suffice for the Seeks. [Edit: You may need to insert some rows if you find that it's producing an Index Scan. Theoretically it should prefer to Scan, knowing there's no rows, but my tests showed it would always Seek]
Invoke-Sqlcmd -server . -data tempdb -query "create table dbo.collation_test (col varchar(10) collate $collation primary key);"
Now I get into the looping.
$c = $firstletter;
$stillgoing = $true;
I construct the query I want, seeking for entries which start with whatever $c has reached, and get the plan for it:
$query = "select col from dbo.collation_test where col like '$($c)%';";
[xml] $pl = get-sqlplan $query "." "tempdb";
At this point, my $pl variable is a scary piece of XML, representing the execution plan. A bit of hunting through it showed me that the EndRange element contained what I was after, and that if it contained NULL, then I was done.
$stillgoing = ($pl.ShowPlanXML.BatchSequence.Batch.Statements.StmtSimple.QueryPlan.RelOp.IndexScan.SeekPredicates.SeekPredicateNew.SeekKeys.EndRange -ne $null);
Now I could grab the value out of it (which came with apostrophes that needed stripping), and append that to my $alphabet variable.
$alphabet += $c;
Finally, finishing the loop, dropping the table, and showing my alphabet!
Invoke-Sqlcmd -server . -data tempdb -query "drop table dbo.collation_test;";
When I run all this, I see that the Swedish alphabet is ABCDEFGHIJKLMNOPQRSTUVXYZÅÄÖ, which matches what I see at Wikipedia. Interesting to see that the letters on the end are still there, even with Case Insensitivity. Turns out they’re not just “letters with accents”, they’re letters in their own right.
I’m sure you gave up reading long ago, and really aren’t that grabbed about the idea of doing this using PowerShell. I chose PowerShell because I’d already come up with an easy way of grabbing the estimated plan for a query, and PowerShell does allow for easy navigation of XML.
I find the most interesting aspect of this as the fact that the Query Optimizer uses the next letter of the alphabet to maintain the SARGability of LIKE. I’m hoping they do something similar for a whole bunch of operations.
Oh, and the fact that you know how to find stuff in the IKEA catalogue.
If you are interested in whether this works in other languages, you might want to consider the following screenshot, which shows that in principle, it should work with Japanese. It might be a bit harder to run this in PowerShell though, as I’m not sure how it translates. In Hiragana, the Japanese alphabet starts あ, ぃ, ぅ, ぇ, ぉ, ...