<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://www2.sqlblog.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Adam Machanic : contest, join algorithms</title><link>http://www2.sqlblog.com/blogs/adam_machanic/archive/tags/contest/join+algorithms/default.aspx</link><description>Tags: contest, join algorithms</description><dc:language>en</dc:language><generator>CommunityServer 2.1 SP2 (Build: 61129.1)</generator><item><title>Solution for the "LIKE vs. ?" Puzzle</title><link>http://www2.sqlblog.com/blogs/adam_machanic/archive/2008/10/01/solution-for-the-like-vs-puzzle.aspx</link><pubDate>Wed, 01 Oct 2008 19:37:00 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:9164</guid><dc:creator>Adam Machanic</dc:creator><slash:comments>9</slash:comments><comments>http://www2.sqlblog.com/blogs/adam_machanic/comments/9164.aspx</comments><wfw:commentRss>http://www2.sqlblog.com/blogs/adam_machanic/commentrss.aspx?PostID=9164</wfw:commentRss><description>&lt;p&gt;In late April, I posted a &lt;a href="http://sqlblog.com/blogs/adam_machanic/archive/2008/04/22/sql-server-query-processing-puzzle-like-vs.aspx"&gt;puzzle to test readers' knowledge of SQL Server query processing internals&lt;/a&gt;.
The goal of the puzzle was to take a simple yet incredibly inefficient
query and rewrite it, without changing the base tables or adding any
additional database objects, to improve performance.&lt;/p&gt;&lt;p&gt;The deadline
for entries was May 1. I expected to receive a few responses, sort
through them, and get a solution post published within the week. But
alas, that's not what happened. I received hundreds of submissions,
many of them really, really good. I started sorting them, and during
that process I spoke at a &lt;a href="http://sqlblog.com/blogs/adam_machanic/archive/2008/04/24/ot-sqlteach-almost-here.aspx"&gt;couple of&lt;/a&gt; &lt;a href="http://sqlblog.com/blogs/adam_machanic/archive/2008/06/05/demos-from-my-teched-session-dat302-best-practices-for-exception-handling-and-defensive-programming-in-microsoft-sql-server-2005-and-2008.aspx"&gt;conferences&lt;/a&gt;, &lt;a href="http://sqlblog.com/blogs/adam_machanic/archive/2008/05/28/when-your-hard-drive-comes-knocking-a-cautionary-tale.aspx"&gt;my hard drive crashed&lt;/a&gt;
and I lost my work (but none of the submissions, luckily), and I was
generally too busy to sort through the hundreds of submissions.&amp;nbsp; Then I
finally got some time and started working on this post on June 24, and
then... Well, I have no further excuses.&amp;nbsp; I just didn't finish.&lt;br&gt;
&lt;/p&gt;&lt;p&gt;So I would like to start with an apology for the extensive
delay. I know how impatient DBAs are, and several readers were kind
enough to e-mail me and &lt;strike&gt;call me out for my laziness&lt;/strike&gt; prod me along all this time.&lt;br&gt;&lt;/p&gt;&lt;p&gt;I am happy to announce that today, finally, I finished sorting and judging the submissions, and I have to say that I
am extremely impressed with everyone who submitted! After eliminating
all but the very best submissions, I still had 20 on my list.&amp;nbsp; Choosing
the top two was extremely difficult; I had to decide between excellence
and excellence. After thinking about it for a while, I decided to
give one award for the best code and a second for the best explanation
(as both of these were required per the rules of the challenge).&amp;nbsp; And I still ended up awarding three people rather than just two.&lt;br&gt;
&lt;/p&gt;&lt;p&gt;Before I get to the solutions, let's take a second (or Nth) look at the original query:&lt;/p&gt;
&lt;pre style="margin-left:40px;"&gt;SELECT *&lt;br&gt;FROM b1&lt;br&gt;JOIN b2 ON&lt;br&gt;    b2.blat2 LIKE b1.blat1 + '%'&lt;/pre&gt;
&lt;p&gt;The difficulty here, from a query processing point of view, is the
nature of the JOIN predicate. The task for the query processor
is to figure out how, given a 'blat2' value, all of the associated
'blat1' values can be determined. The query processor has only three
possible options for satisfying this (or any other) join:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;A merge join. In this case, it's not possible, as SQL Server's
merge algorithm only supports equijoins. In this case, we're joining
based on a range: all 'blat2' strings that start with the same
characters asthose in 'blat1'. &lt;/li&gt;&lt;li&gt;A hash join. Again, not possible, as we need an equijoin. SQL
Server's hash algorithm is not ordered (which is not to say that it &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.35.5835"&gt;couldn't be in the future&lt;/a&gt;), so hash keys must be compared based on equality.&lt;/li&gt;&lt;li&gt;A loop join. This join type can be used, for better or for worse,
in all cases.&amp;nbsp; And in this case it does get used, for worse.&amp;nbsp; For every
row of table b1, table b2 gets scanned and all of the matching rows are
found. That's a lot of scans, and a lot of work.&lt;/li&gt;&lt;/ol&gt;
&lt;p&gt;There must be a better way. Clearly, the key is to try one of
those other join types, but we need an equijoin!&amp;nbsp; Virtually everyone
who submitted responses figured this out, whether by accident or on
purpose, and almost everyone came up with a query approximating the
following:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;SELECT *&lt;br&gt;
FROM b1&lt;br&gt;
JOIN b2 ON &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; LEFT(b2.blat2,5) = b1.blat1&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This query solves the problem, and solves it well.&amp;nbsp; It turns the
query from a two-minute nightmare to a half-second dream, all via the
magic that is the hash join.&amp;nbsp; But don't take my word for it.&amp;nbsp; Instead,
why don't we let contest winner #1, &lt;b&gt;Chamindu Munasinghe&lt;/b&gt; explain?&amp;nbsp; Following is a slightly edited version of his winning submission.&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;The optimized query uses a hash match to perform the join
operation. When a hash match is performed the table b1 is scanned and a
hash table is built in memory. The hash key is computed using the
b1.blat1 column and the rows from b1 are added to it. Then another
table scan is performed on the b2 table and a hash key is computed
using LEFT(b2.blat2, 5) value and then the previously built hash table
is searched [ed: "probed"] on the key and matches are made.&lt;br&gt;
  &lt;br&gt;
But if you consider the query that uses the LIKE operator, the database
engine cannot perform a hash join because we don't have a value from b2
to compute a hash. Instead it has to do a nested loops join where for
each row in b1 it loops over the b2 table and tries to match the blat1
and blat2 columns using the wild card pattern.&lt;br&gt;
  &lt;br&gt;
If you have N.b1 rows in b1 table and N.b2 rows in the b2 table the
number of operations required to do the join can be computed like this&lt;br&gt;
  &lt;br&gt;
Optimized query = N.b1 + N.b2&amp;nbsp; (Scan both tables)&lt;br&gt;
LIKE query = N.b1 * N.b2 (Nested Loops)&lt;br&gt;
  &lt;br&gt;
Due to that the optimized query is much faster. I have assumed that the
number of rows in b1 can be accommodated in memory and a the hash match
is done in memory. If this is not the case then the computations will
be different because it can use a grace hash match or a recursive hash
match. But the hash match is still faster than the nested loop.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Chimandu pointed out in his submission that he is a .NET developer
-- not a SQL Server professional -- and it's clear in his answer that
it was his knowledge of the inner workings of data structures that
helped him answer the question so well.&lt;/p&gt;
&lt;p&gt;Now that we've looked at the winning explanation, which explains why fastest possible way to solve the puzzle works, I
will reveal the second set of winners -- a tie, actually, between two
creative coders. &lt;/p&gt;
&lt;p&gt;&lt;b&gt;Andrew den Hertog &lt;/b&gt;came up with the same solution as Chimandu
and everyone else, but took it one step further. He was bothered by the
fact that this solution was hardcoded and would only work with an input
of five characters, so he smartly took a stab at making it dynamic:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;SELECT&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; b1.*, &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; b2.*&lt;br&gt;
FROM b1&lt;br&gt;
CROSS JOIN&lt;br&gt;
(&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; SELECT &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; MAX(LEN(blat1)) AS fieldSize &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; FROM b1&lt;br&gt;
) AS fs&lt;br&gt;
JOIN b2 ON &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; LEFT(b2.blat2, fs.fieldsize) = b1.blat1&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This solution is a bit slower than the non-dynamic version, and &lt;i&gt;almost&lt;/i&gt;
works... But alas, it fails to return the same results if you modify
the script that builds the two input tables, and change the number of
characters to, say, 7 rather than 5. The reason this fails is due to
the fact that some of the input values are only 5 characters long --
which is the reason I chose that as the initial length.&lt;br&gt;
&lt;/p&gt;
&lt;p&gt;Around the same time that Andrew came up with his solution, &lt;b&gt;Kevan Riley&lt;/b&gt;
sent me a non-dynamic way of solving the problem generally for any
input character length. The trick is to delimit both of the input
strings: &lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;SELECT * &lt;br&gt;
FROM b1&lt;br&gt;
JOIN b2 ON &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; '#'+b1.blat1+'#' = '#'+LEFT(b2.blat2,7)+'#'&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Great job solving that one, Kevan!&amp;nbsp; Combining this solution with
Andrew's, we get the best possible solution, totally dynamic and viable
for all possible input cases:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;SELECT&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; b1.*, &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; b2.*&lt;br&gt;
FROM b1&lt;br&gt;
CROSS JOIN&lt;br&gt;
(&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; SELECT &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; MAX(LEN(blat1)) AS fieldSize &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; FROM b1&lt;br&gt;
) AS fs&lt;br&gt;
JOIN b2 ON &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; '#'+LEFT(b2.blat2, fs.fieldsize)+'#' = '#'+b1.blat1+'#'&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Great job, &lt;b&gt;Chamindu&lt;/b&gt;, &lt;b&gt;Andrew&lt;/b&gt;, and &lt;b&gt;Kevan&lt;/b&gt;!!!&amp;nbsp; I would also like to name four runners-up, each of whom also provided excellent submissions:&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Gordon Klundt&lt;br&gt;
Creighton Kagey&lt;br&gt;
Erik Eckhardt&lt;br&gt;
Greg Bodden &lt;/b&gt;&lt;/p&gt;
&lt;p&gt;Again, this was an extremely difficult choice and I'm sorry I was
not able to award everyone, but if you're on this list please give
yourself several pats on the back for your great work.&lt;/p&gt;
&lt;p&gt;Thanks again to everyone who participated.&amp;nbsp; I was truly impressed by
all of the submissions.&amp;nbsp; This contest was a lot of fun, and I have a
few ideas for some other challenges for the near future. Stay tuned. &lt;br&gt;
&lt;/p&gt;&lt;img src="http://www2.sqlblog.com/aggbug.aspx?PostID=9164" width="1" height="1"&gt;</description><category domain="http://www2.sqlblog.com/blogs/adam_machanic/archive/tags/query+processing/default.aspx">query processing</category><category domain="http://www2.sqlblog.com/blogs/adam_machanic/archive/tags/hash+match/default.aspx">hash match</category><category domain="http://www2.sqlblog.com/blogs/adam_machanic/archive/tags/join+algorithms/default.aspx">join algorithms</category><category domain="http://www2.sqlblog.com/blogs/adam_machanic/archive/tags/contest/default.aspx">contest</category><category domain="http://www2.sqlblog.com/blogs/adam_machanic/archive/tags/string+searching/default.aspx">string searching</category><category domain="http://www2.sqlblog.com/blogs/adam_machanic/archive/tags/nested+loops/default.aspx">nested loops</category></item></channel></rss>