THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

The Grouped String Concatenation Challenge is Closed

Just over two weeks ago I posted the Grouped String Concatenation Challenge. A more difficult challenge than the last one I posted, partially in hopes that not as many people would submit solutions and it would be easier for me to judge. But alas, my readers are obviously really into this stuff, and I received submissions from 39 people, some of whom sent me up to four different solutions. I didn't expect such an amazing response, and now it's up to me to sort through all of these e-mails and declare a winner--no easy task.

The submissions I've looked at so far have been of very high quality, and most of them use FOR XML PATH('') in some way. A few people experimented with other techniques--which is exactly what I was hoping for--and I'll be especially interested in seeing how their solutions stack up. I also spent a lot of time this last two weeks working on alternative solutions, and most of my attempts didn't scale well.  I'm hoping that some reader managed to come up with a trick I haven't thought of yet.

The judging will be done as follows:

  1. I'll run each submission and test it for correctness against my own benchmark query (the results of which were attached to the original post). If any of the rows don't match exactly, that submission will be disqualified. 
  2. Next I will run each query twice, and will eliminate any query that doesn't run in less than four seconds on my notebook (the fastest ones I've already seen run in two).
  3. The remaining queries will be run through SQLQueryStress sessions on a version of AdventureWorks modified to have twice as many customers and eight times as many orders. Queries in this stage will be scored on a bell curve where the top 15% will receive a score of 10, the upper-middle 35% a score of 6, the lower-middle 35% a score of 2, and the rest a score of 0.  The 0s will be eliminated.
  4. The remaining queries will then be evaluated for readability. This is purely subjective and since I'm the judge my idea of what constitutes readability wins. So I hope you've been reading my blog for a while and have absorbed some of my best practices in this area! Queries will be scored based on the same curve as before.
  5. At this stage the queries with the highest combined scores will be weighted based on innovation and the ability to apply their techniques to a general pattern, in order to find the number one query.

Thanks again to everyone who submitted! I'm looking forward to judging the submissions. Watch this space for a comprehensive breakdown of the results, to be posted sometime in the next couple of weeks.

Published Sunday, March 15, 2009 9:22 PM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Dean said:

I thought the deadline was midnight, Mar 16? My entry is almost ready to be emailed.

March 15, 2009 9:14 PM
 

Adam Machanic said:

Midnight, GMT.  It's now 1:35 a.m.

March 15, 2009 9:36 PM
 

Peso said:

Maybe a culture issue?

For example, when we in Sweden say midnight the 16th of March, we mean 24:00 (23:59:59 + 1 second).

I believe Adam meant 00:00:00 the 16th of March (12:00 AM).

March 16, 2009 3:23 AM
 

MontyMole said:

Are they going to be tested to check that they work correctly with data with an & in them (the sample data doesn't have any) or do they just have to work with the sample data?

March 16, 2009 8:19 AM
 

Virgil Rucsandescu said:

Probably Adam wrote a script to compare each result set with the required result, so probably  each solution sent by us will be run against the data in AdventureWorks and if the result doesn't fit exactly with the required data set, then it's out of competition ...

March 16, 2009 9:39 AM
 

Adam Machanic said:

No, I won't be checking for entitization problems, but perhaps I'll give a bonus to those that did include workarounds in their solutions (very few).

March 16, 2009 9:39 AM
 

Adam Machanic said:

Alright, here's the deal: Three people have now e-mailed me saying they misunderstood the meaning of "midnight", and even though I think it's pretty clear I'm going to be nice about it.  I'll allow entries through 11:59 p.m. GMT today. But here's the catch: I am going to judge these late entries a bit harder than the rest, because you've had more time and, perhaps, the chance to read about how I'll judge the competition. So anyone submitting late had better send me something really, really good if you want to be a competitor!

March 16, 2009 12:48 PM
 

virgilrucsandescu said:

I made my query with ~ 27% faster, but I suppose it is not honest to re-send it now ... Hopefully the fastest query in this challenge will be even faster than what I have now ;-)

March 16, 2009 6:33 PM
 

RBArryYoung said:

"Next I will run each query twice, and will eliminate any query that doesn't run in less than four seconds on my notebook"

Umm, is it OK to ask about your Notebook's configuration?  That'll tell me real fast if I'm in or out...  :)

March 16, 2009 7:14 PM
 

Peso said:

It's a good thing I didn't send my suggestion that only required 60k reads for the entire solution!

It runs for 30 seconds... :-)

March 18, 2009 6:39 AM
 

Virgil Rucsandescu said:

Strange... I think I have more reads even in "SET STATISTICS IO" info (my profiler is crazy - it shows ~ 100 times more reads than the stats info) and my query runs in ~ 3 sec (elapsed time). But my desktop is a quad core, I suppose it will need more on a notebook - so I am probably out of the competition. I am really curious if there is a non - FOR XML PATH('') solution that is very fast ...

March 18, 2009 11:32 AM
 

RBArryYoung said:

I fell compelled to point out that comparing entitized XML solutions to corrected (non-entitized) solutions is not very fair since fixning the entitization problems adds almost 30% CPU overhead.

March 18, 2009 12:31 PM
 

Adam Machanic said:

Well it's not like I asked anyone to fix that problem in the code, and I pointed out in the original post that performance should be the main concern.  A lot of people, though, insist upon putting their own spin on things, for better or for worse.  For example, one person sent me a solution where he used ROUND to take all of the numbers to two digit scale.  I don't recall asking for that, and his solution was immediately kicked out for not matching the posted sample data.  I just don't get peoples' thought processes sometimes...

Your submission, which did deal with entitization (as I'm sure you're well aware <g>), made it through the first round.  We'll see whether it scales under load.  It will be an interesting test...

March 18, 2009 1:03 PM
 

Peso said:

To keep the suspense, will you post the number of queries passing each test?

March 18, 2009 3:23 PM
 

Peso said:

SET STATISTICS IO has a bug when MERGE JOIN eller HASH JOIN is used.

See this connect issue

https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=387326&wa=wsignin1.0

Using SQL Profiler however displays the total number of reads.

March 18, 2009 3:27 PM
 

George Palacean said:

Adam, the original post states that "Entries will be judged first and foremost on correctness, then on a combination of performance, readability, and ability to apply your technique as a general pattern."

So it's not performance the main concern, but correctness (here I would include de-entitizing XML special characters, otherwise the output is simply incorrect)

March 19, 2009 8:37 AM
 

Adam Machanic said:

George: The only thing I'm going to look at with regard to correctness is whether the output meets the output that my own query produces. I will not eliminate any queries that fail to deal with entitization, because the data is one of the givens of the contest and the data has no such issues.

That said, I don't think it will matter anyway from a performance point of view.  I just ran some tests on this end and despite the fact that STATISTICS TIME is reporting slightly higher CPU utilization for FOR XML PATH queries that properly deal with the entitization, I'm seeing no difference in terms of performance (testing using 100-thread, 100-iteration SQLQueryStress runs).

March 19, 2009 1:43 PM
 

Roman said:

Well, it seems strange - even with [comment()] trick (which is fast but hardly general solution) I find not so small overhead, forget about ".value".

I'm realy looking forward to see the best solution. I'v already  made 10% faster query myself :( - pity that I spotted your challange so late and had only 2 hours.

March 19, 2009 7:41 PM
 

Linchi Shea said:

Adam;

I'd suggest that while you continue your evaluation and your process of picking a best solution, simply post all the sulotions and let the folks see what solutions have been produced and learn from others. Performance is but one of many dimensions, and people can learn a heck a lot just by looking at what angels the problem is being appraoched.

I don't believe you said anything about the folks not posting their solutions after the deadline. So the other alternative is for people to post their own solutions here, if they so choose to (and in case they don't want to post their solutions for whatever reason).

I for one am interested in seeing all the other solutions.

March 19, 2009 10:49 PM
 

Adam Machanic said:

Linchi: Fair enough; I will package up all of the solutions I received (unless someone e-mails me and asks me not to), and post it along with the results.

Roman: You just taught me a new one.  I had no clue about that "comment" trick.  That is fantastic, and I'm going to go modify my Who is Active? script to use it instead of FOR XML PATH for the query output.  Thanks for the tip!

March 20, 2009 10:15 AM
 

Adam Machanic said:

... or not -- it seems that the comment() trick has a major limitation:

select ' --- ' as [comment()]

for xml path('')

Result:

Msg 9322, Level 16, State 2, Line 3

Two consecutive '-' can only appear in a comment constructor if they are used to close the comment ('-->').

March 20, 2009 10:52 AM
 

Roman said:

as I said it's hardly general - there are some limitations, but:

Replace('<!-- -- this can help -- -->','-',Char(150))

Replace('<!-- -- or this -- -->','--',Char(150)+Char(150))

(reverse back if this change is a problem)

probably this leads to a little ugly looking code, but still could be faster then .value

March 20, 2009 12:04 PM
 

Adam Machanic said:

Might be faster than .value, but REPLACE can be very slow for big strings so that's very much an open question.  Alas, not at all useful for my needs with my Who is Active? script, which is takes advantage of the fact that SSMS lets you click on XML values to pop them up nicely formatted in a new tab.  Entitization definitely bites me there and I was hoping that this comment() thing will work, but obviously most SQL has some form of comments so that is not going to be usable...

March 20, 2009 1:32 PM
 

Roman said:

stupid me: [processing-instruction(a)] is much better then [comment()]; of course 2xreplace have still to be made

in this particular case for sample data .value is 50% slower

March 20, 2009 4:19 PM
 

Adam Machanic said:

Wow, that looks VERY promising!  But how did you find that?  Is there a list somewhere of different commands that can be used in square brackets with FOR XML?

March 20, 2009 4:36 PM
 

Roman said:

BOL :); although at first I tried [cdata()] by intuition - it didn't work - so I've checked the documentation.

Anyway - I have quite diffrent result using your test script on my dev maching (i7)

.value - 375ms; [processing-instruction(a)] 1179 - so on my i7 it is 3x slower, not faster (from SET

Previous result was with my challage script (a little modfifed) with your SQLQueryStress (this tool looks promising :) btw; found 1 problem: allowed query length seems to be too low - couldn't accomodate query with use plan option) 10 iterations, 100 threads

March 20, 2009 5:35 PM
 

Adam Machanic said:

Ah, notice that I took down the script to play with it some more :-)

I found that converting to VARCHAR(MAX), rather than relying on the implicit conversion to NVARCHAR(MAX), makes processing-instruction() much faster (which makes perfect sense -- less data to do the REPLACEs on).

Regarding SQLQueryStress, I highly recommend a stored procedure for that case.  You could, if you don't want to create a permanent stored procedure, create a temporary global stored procedure first and test against that.  Sending all of that data to SQL Server on every iteration is going to stress more than just the query itself -- it's going to stress the network, the query parser, etc.

March 20, 2009 5:51 PM
 

Roman said:

Thanks for the tip - you are of course right - it wouldn't make sense to test that way - I should have realized that earlier.

One observation about your taken down script: datalength takes much longer with /for xml path('')/ then with /for xml path(''),type/. 882ms - 369 ms;

besides for me execution plan shows costs wich are in contradiction to execution times

March 20, 2009 6:31 PM
 

Roman Nowak said:

50% faster was not enough for me ;)

SQLQueryStress tool (10 iteration, 100 threads) (min:sec)

v1 (ent off) 4:41

v1 (ent on) 9:50* (.value method)

v1a (ent off) 4:11

v1a (ent on) 4:32* (pi method)

from the very begining I was ure I can do better - had to prove it to myself.

besides it was fun and I've learned sth.

(after some limited test it seems pi is faster than .value - depending on the string size from >3x to 5% - there are some limitation but not that big)

March 24, 2009 5:59 AM
 

Adam Machanic : Who is Active? v8.40 - Now With Delta Power! said:

March 30, 2009 12:53 PM
 

Peso said:

Deadline has occured, and while Adam finishes his tests I decided to publish my suggestion here

http://weblogs.sqlteam.com/peterl/archive/2009/02/28/Competition.aspx

April 1, 2009 4:16 AM
 

mjswart said:

(april 22) Any word on the results? I'm very curious :-)

April 22, 2009 11:30 AM
 

mjswart said:

(may 21) still curious :-)

May 21, 2009 1:19 PM
 

Henrik Sjang Davidsen said:

I'm quite curious as well

May 28, 2009 2:53 PM
 

Adam Machanic said:

After weeks of putting it off, I finally found the time and spent the last day and a half judging the

May 31, 2009 4:58 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement