After weeks of putting it off, I finally found the time and spent the last day and a half judging the Grouped String Concatenation Challenge. I would like to congratulate the winner, Peter Larsson, who submitted a great query and walks away with a shiny new MSDN Premium subscription.
For those who are interested, following is a breakdown of the judging process, along with some commentary:
To begin with, e-mails. As I mentioned in the first post, I ignored all e-mails that didn't follow the directions. Luckily this was only a few submissions. I felt it rather odd that people would spend a not insignificant amount of time working up a solution only to not bother to read the guidelines thoroughly. But that's human nature, I suppose.
Once I collected all of the queries that followed the e-mail rules (all of which are included in the attached ZIP file), I began testing against an expanded version of AdventureWorks (the script for that is also included). I decided to eliminate any queries that did not produce the correct output data based on my sample set, or which took longer than 30 seconds to complete. The majority of queries did complete in a reasonable amount of time, and many were eliminated because the output simply wasn't correct. The biggest issue was ordering of the elements in the comma-delimited sets. I also deducted points from one person's entry because of invalid column names, but I decided to let the entry ride to the next round.
An important side note is that I created this competition with the sole intention of discovering new and different ways to do grouped string concatenation, and my hope was that someone would come up with a clever, fast solution. Unfortunately, that didn't happen, and every submission that used any technique except FOR XML PATH was eliminated in the first round of testing. I received some extremely creative solutions from a couple of people and I would like to mention them here:
- Alejandro Mesa's submissions made use of various XQuery techniques, and are very interesting to look at, although fairly slow
- Dean Cochrane's submission used an interesting idea of doing a MAX(CASE ...) pivot for the lists. Alas, the product names lists were not correct, so the submission didn't make it to the stress testing phase
- Scott Coleman tried a similar technique, actually using the PIVOT keyword. Unfortunately, this ran for over 200 seconds, so it was eliminated
Also interesting to note is that a few people tried recursive CTE solutions. These were all cancelled at the 300 second mark. Recursive CTEs, as mentioned before on here on SQLblog, simply do not scale in their current implementation.
After tabulating the Round 1 results I was left with 18 queries, and some obvious contenders. I ran each query through a SQLQueryStress session with 10 threads running 5 iterations each. In this phase the queries were separated into fairly distinct groups: Those that ran for around 5 minutes, those that ran for around 7-8 minutes, and those that ran longer. These groups were based, not surprisingly, on how much attention was paid by the query writers to the little details. For example, Peter Larsson's winning query cut down on logical reads dramatically by doing some of the grouping in a derived table, rather than in the outermost query as some of the other submissions did.
Lesson learned: When doing aggregations, especially when joining a lot of tables, think about what you're really aggregating, and do the aggregation as early as possible. For example, if you need to aggregate sales per customer and get customer names, do the aggregation of the sales numbers first, then join out to get the customer names. Otherwise the query processor is forced to do more work than it has to do, and your query won't be as fast. Peter and a few other contestants understood this distinction and wrote queries that were much faster as a result.
Round 2 eliminated 4 queries, leaving me with 14 to judge based on query style. In order to judge consistently, I came up with 10 factors. A query was allotted 500 points to start, and failure to meet each factor resulted in a 50 point penalty. These factors were:
- Consistent Indentation
- Does the query use the same rules for indentation in all parts? This is huge for readability and helps people understand where each section of the query starts and ends.
- Consistent Capitalization
- Does the query use the same rules for capitalization throughout? For example, keywords should be either all capitalized, or all lowercase.
- Capitalize Keywords
- I like to see keywords capitalized.
- Use AS for Alias Names
- AS is optional, and I've left it out in many queries I've written. But the more of other peoples' code I read, the more I realize that it really does help on the readability front. Use it. Always.
- Follow Capitalization of Base Tables/Columns
- If the base table is called OrderHeader, I want to see it used as OrderHeader when referenced in your query, rather than orderheader. A trainer I know found this out the hard way, when he reinstalled SQL Server on his laptop shortly before a training session, and used a case-sensitive collation rather than his previously-installed case-insensitive collation. He had been careless in adhering to capitalization for his training materials, and discovered the issue in front of the class. Oops.
- Long Horizontal Lists
- I don't like horizontal scrolling, and I find long lists difficult to read.
- Consistent Vertical Lists
- Put either a comma after each element or before each element, not both. Indent your lists the same way throughout. If you indent some items below the SELECT, don't put other items on the same line as the SELECT (or GROUP BY, or ORDER BY, etc)
- Alignment of Delimiters
- I follow a .NET-inspired style where I put delimiters on their own lines, and line them up vertically. This gives my code what I feel is an airy, easy-to-read feel. When reading others' code I look for some kind of alignment. Failure to align delimiters makes it very difficult to understand, again, where one section begins and another ends. By the way, common delimiters for this challenge included both parens and CASE...END.
- Does the query have comments? Are the comments useful in understanding the logic?
- This is perhaps the most subjective. My general feeling on how I enjoyed reading the code.
All in all, the queries were pretty good. I would like to call out Rick Halliday, who had the highest score in this round with some very well formatted and highly readable code.
After judging Round 3 I tallied all of the scores and was left with a tie for top 3:
- Rick Halliday
- Leonid Koyfman
- Peter Larsson's query #4
All three of these queries were well thought out, but only one could win, so I took another pass through each. Rick's query, though extremely well written and readable, was eliminetad first due to the fact that it performed worse than the other two. This left Leonid and Peter. It was a tough choice, but I had to give the prize to Peter for taking the time to really think through the problem and figure out exactly how best to do the aggregations. Leonid was a very, very close second, and I really wish I had a consolation prize for him.
And that's that. Thank you to everyone who participated in the challenge. I hope it was as much a learning experience for you as it was for me. Congratulations again to Peter. All of the materials are attached in the ZIP file; please let me know if you have any questions, comments, etc.