...but first, let’s look at one oft-forgotten reason why finding a particular piece of text can be slow, if not impossible: collation. This will then provide a useful platform for making it go faster.
I say ‘impossible’, but of course it’s never impossible to find something in a database (assuming it’s there). It might take longer, but you can always scan the column for it, starting on the first page and going until you’ve found it. Various things like Full Text Search can help make things easier, but all-too-frequently we see code that searches for SomeText%, or worse: %SomeText%. This is the thing I want to look at – finding non-indexed strings patterns.
First let’s remember that if we are hoping to use an index, we need to know what language we’re in. I have spoken before about how I picked up a map in Sweden to find Västerås, but couldn’t find it listed in the map’s index. I didn’t realise that in Swedish, ‘ä’ and ‘å’ are not the same as ‘a’, and found at a different part of the alphabet. When I searched using an English alphabet, I couldn’t find the entry. I might think that ‘Västerås’ and ‘Vasteras’ are the same, but a Swedish person would tell me otherwise. It’s like if I refer to a game as ‘football’, you would need to understand my personal collation setting to know what I was talking about. When Michael Palin sung (as a lumberjack) about wearing high-heels, suspenders and a bra, he wasn’t referring to anything that held his trousers up, despite what people using an American collation setting might think.
But this is about making searches for text go faster. If we’re comparing two strings in a different collation we get an error, but let’s think about speed.
Consider that I’m looking for rows in a table WHERE CommentText LIKE '%Farl%'. Right away, I’m sure you appreciate that no amount of regular indexing on CommentText would let me perform an ordinary b-tree index search to find that row. I could improve it by using other technologies that will allow the individual words in my text to be found, but I’m just looking for a particular piece of text. It’s not even a whole word.
For my experiment, I’m using a table on SQL 2014 on my Surface Pro 2. It’s a larger version of AdventureWorks2012’s Person.Address with 19 million rows. There is a column called AddressLine1, which has collation SQL_Latin1_General_CP1_CI_AS and has type nvarchar(120). You can create it using code like this:
CREATE TABLE [Person].[Address_Big](
[BigAddressID] [int] IDENTITY(1,1) NOT NULL,
[AddressID] [int] NOT NULL,
[AddressLine1] [nvarchar](60) NOT NULL,
[AddressLine2] [nvarchar](60) NULL,
[City] [nvarchar](30) NOT NULL,
[StateProvinceID] [int] NOT NULL,
[PostalCode] [nvarchar](15) NOT NULL,
[SpatialLocation] [geography] NULL,
[rowguid] [uniqueidentifier] NOT NULL,
[ModifiedDate] [datetime] NOT NULL
select * from Person.Address;
I ran this query quite a few times, and it took about 40 seconds to tell me there were no rows returned.
where AddressLine1 like N'%Farl%'
option (maxdop 1);
Obviously no one lives on 203 Farley Avenue, or 1 Farlabulous Drive. But nor do they live at 711 Gofarles Street. You see, despite the fact that I had specified ‘Farl’ with a capital F and lower-case ‘arl’, it didn’t care about that at all. My collation setting told it explicitly to ignore case. That’s what the CI was for in SQL_Latin1_General_CP1_CI_AS. In fact, if we query select * from fn_helpcollations() where name = 'SQL_Latin1_General_CP1_CI_AS'; we see it says “Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data”. So it’s not only case-insensitive, it’s also kanatype-insensitive and width-insensitive too.
Clearly there is a lot more work for it to do, when scanning large amounts of text looking for a group of consecutive characters that could match inexactly.
Now, without changing my table at all, I changed my query like so, telling it to use a binary collation for the search. To search exactly rather than inexactly.
where AddressLine1 like N'%Farl%' collate Latin1_General_BIN
option (maxdop 1);
I could’ve used the SQL collation SQL_Latin1_General_CP437_BIN, but I find it easier to remember the Windows collation, and in Australia, the default settings for SQL are to use Windows collations. They’re a little better than the SQL collations .
But anyway – this query returned in just 7 seconds. I re-ran the original one – 40 seconds. I re-ran this one – 7 seconds. It really was significantly faster. The plan is the same. There is no sneakiness going on. The search for the binary text was simply faster.
This makes sense. If I’m looking for a particular string, it’s going to be quicker if I can just look for the exact bits, and not have to consider what the text might be in a different case, or if width needs to play a part, and so on.
Now you might think “Great – I’m going to add that to all my string searches”, but you should understand that there is potential for the results to be different. If there were someone in my table who lived in FARLEY BOULEVARD (in all caps, in the way that French people often write their surnames, for example), then that would have been found in my case-insensitive-collation search, but not in my binary-collation search for the lower-case letters ‘arl’. It’s useful if the data in your system is only stored in capitals, in which case (ha!) you could actually change the collation of your column, but it’s definitely worth considering the benefits of asking for a collation-specific search.
And what about grouping, you ask? Ok, maybe I didn’t hear you ask that, but let’s pretend you did.
If there’s an index on the column you’re grouping, then changing the collation is going to hurt a bit. Grouping could take advantage of a Stream Aggregate under our indexed collation, but changing the column is like throwing it away the index order (ORDER BY doesn’t get handled well by changing the collation) means a Hash is required. But comparing two query plans that both use Hash Match (Aggregate), one on a case-insensitive collation and one on a binary collation, then I found the latter was slightly faster. Not as drastic a change as searching, but still 10-30% better. One would run in about 12 seconds, and one in about 10.
select City, count(*)
group by City
option (maxdop 1);
select City collate Latin1_General_BIN, count(*)
group by City collate Latin1_General_BIN
option (maxdop 1);
Considering what’s going on with a hash function and non-exact strings is actually pretty interesting. HASH(Value) must produce the same value for any two values that are considered equal – such as ‘FARLEY’ and ‘Farley’ in my CI collation. For this to happen, it obviously can’t hash the actual values, it must have to convert the values into a common form that will hash the same way regardless of case, kana, and width. But this is work that is hidden from the query plan. We can see the impact of it through the query speed, but not anywhere in the plan. This will become yet another thing for me to investigate – but not this week before T-SQL Tuesday comes around and I need to publish this post. New father Bob Pusateri (@sqlbob) is hosting this month, about text searching, in case you hadn’t guessed.