THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

T-SQL Challenge: Grouped String Concatenation

It's been quite a while since the LIKE vs ? Puzzle, and I feel like it's time for another one. Response was overwhelming last time, and I'm back with a much tougher puzzle and a much bigger prize. So get ready, because I'm going to really make you stretch your brain and your T-SQL skills for this one.

But first, a bit of background. String concatenation is something I've talked about on this blog before, and it is an incredibly popular topic; my post on the subject has gotten more hits than any other single post I've ever done. TechNet blogger Ward Pond also understands the popularity of the topic, having discussed concatenation at least five times on his blog. And as illustrated in this excellent article by Anith Sen, there are a number of methods available to the intrepid SQL Server explorer; the techniques Ward and I show are just the tip of the iceberg.

And while all of that is great, there is a deep and troublesome hidden problem: Nowhere in my post, nor Anith's article, nor Ward's series, will you find a technique that completely solves the concatenation problem. The FOR XML PATH('') method--by far the most popular SQL Server 2005 "trick" I see repeated over and over in these articles and on forums--is a bit limiting. It doesn't help when we need to "group" our string concatenation, i.e. concatenate strings for a number of key values and return multiple rows in the result. And FOR XML PATH('') also leaves us a bit flat if we need to return aggregated data with the concatenated strings or--even more interesting--concatenate multiple different columns in the same output.

Sure, there are ways to solve this problem, but they usually require temp tables, user-defined functions (CLR or otherwise), tables of numbers, cursors, or other adjunct objects. And while some of these solutions are certainly workable they lack the beauty of a single, self-contained solution. In the interest of solving this problem I recently created a challenge for myself: Figure out how to do "grouped" concatenation using nothing more than a single T-SQL statement. No temp tables. No UDFs. No procedural logic.

But rather than do the work all alone and simply post my solution, I've decided to invite you to join me in the quest. Are you up for it?

Here are the rules of the game:

  • You are to create a single T-SQL statement that concatenates values from the AdventureWorks (note: not AdventureWorks2008) Sales.SalesOrderHeader, Sales.SalesOrderDetail, Production.Product, and Person.Contact tables.
  • The output should have the following columns, in the following order, and no other columns:
    • CustomerID: The customer's CustomerID (this is the unique key in the output)
    • FirstName: The customer's first name
    • LastName: The customer's last name
    • OrderCount: Number of orders placed by the customer
    • TotalDollarAmount: Total dollar amount of all orders placed by the customer (based on the SalesOrderHeader.SubTotal column)
    • TotalProductQuantity: Total number of items purchased by the customer in all orders (based on SalesOrderDetail.OrderQty)
    • OrderNumbers: Comma-delimited list containing the order numbers (SalesOrderHeader.SalesOrderNumber) for each of the orders placed by the customer
      • The numbers within the list should be alphabetized. The list should have neither leading nor trailing commas, and each element in the list should be separated by a single comma with no spaces or other white space beforeor after the comma
    • ProductNames: Comma-delimited list containing the unique names of all products ordered by the customer in all orders
      • The names within the list should be alphabetized. The list should have neither leading nor trailing commas, and each element in the list should be separated by a single comma with no spaces or other white space beforeor after the comma
  • No tables--permanent, temporary, or variable--are to be created. No dynamic SQL is to be used. No user-defined functions, views, or stored procedures are allowed. No variables may be declared. To put it simply, no permanent or temporary objects of any kind, at any scope, are to be explicitly created. No procedural statements of any kind--cursors or control-of-flow--are allowed. This must be a standalone statement in the AdventureWorks database; nothing more and nothing less.
  • Aside from the previous stipulation, any SQL Server 2005 or 2008 feature is fair game. Documented or not, if it ships with the product and can be used in a standalone T-SQL statement, you can use it. If you do use a version-specific feature, please let me know (especially if it's a SQL Server 2005 feature that's gone in 2008). Bonus points may be given for solutions that work on either version, but I'll make that decision after reviewing the submissions.
  • Entries will be judged first and foremost on correctness, then on a combination of performance, readability, and ability to apply your technique as a general pattern.
    • Just to be absolutely clear: If your submission violates the rules, outputs the wrong data, or does not precisely follow the output guidelines listed above, it will be ignored. Last time I spent a lot of energy going back and forth with people helping them get there, and I just don't have the bandwidth to do that again. So double-check your submission before you send it to me.
    • Make your submission readable or you will lose credit even if performance is amazing. I don't appreciate looking at a mess, and it's good for your career to learn how to write code that others can maintain. Hint: Learn to indent your code properly; lack of indentation is the biggest mistake I see people make with regard to readability.
    • Take your time. You have two weeks to work on this. I've already come up with three different solutions that have vastly different performance characteristics. Perhaps your first shot isn't the best choice?
  • The entry deadline is March 16, 2009, midnight GMT. No exceptions.
  • Submissions should be e-mailed to me, using a .SQL file attachment.
    • Do not paste your solution into the body of your e-mail
    • The subject of your e-mail should be "Grouped String Challenge Submission".
    • E-mail your submissions to [my first name] [at] [this site].
    • Again, be careful and don't violate these guidelines or your submission will be ignored.
  • ... What's that? You want a prize? Fine, fine ...
    • The prize, for the best submission, is a full MSDN subscription, valued at around $10,000. How's that for inspiration?

... and that's that! One final note: Please do not post your solution in the comments here or on another blog, before the deadline has been reached! Last time several people did that and it was incredibly annoying both for me and those contestants trying to think through the problem. You won't be doing yourself any favors by trying to mess up the competition.

Once the deadline is reached I will test all of the submissions, tabulate the results, and post back here in early April. I promise, I won't let the thing stagnate for months like I did last time.

Have fun with it, be creative, and feel free to post comments here with any questions you might have. I found this to be a fairly difficult but very interesting exercise and I hope you agree. Enjoy, and I'm looking forward to seeing what you can do!

Published Friday, February 27, 2009 2:22 PM by Adam Machanic

Attachment(s): grouped_string_concatenation_output.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Aaron Alton said:

What a cool challenge.....

Do you consider a Common Table Expression to be a "temporary object"?  It usually hashes out the same, but I find that CTE's can significantly improve readibility.

February 27, 2009 2:26 PM
 

Adam Machanic said:

Hi Aaron,

Great question, and the answer is no -- a CTE is not a temporary object but rather part of a bigger statement.  All of the following (among other things) are most definitely allowed:

CTEs

Derived Tables

Subqueries

I'll even go so far as to say that any relational operator is also allowed, including UNION, UNION ALL, EXCEPT, INTERSECT. If you can't get the whole result in one query, perhaps you can get it in two or more?

As an aside, I would also like to note that you should not rely on any specific data in the target tables. I will be testing on modified (larger) versions of the tables.  Same indexes, but different data -- so no "cute" attempts to send me all of the data in a bunch of UNIONed SELECTs (which would be a single statement according to the rules of the game, but wouldn't pass my correctness test and would be a complete waste of time for everyone involved).

February 27, 2009 2:32 PM
 

Cade Roux said:

Any chance of a known good target output we can use to run tests against?

February 27, 2009 2:42 PM
 

Adam Machanic said:

Hi Cade,

Sure, I'll get something attached to the post shortly.

February 27, 2009 2:50 PM
 

Matija Lah said:

Well, there are commas in Product.Name values.

Are we allowed to use semi-colons to separate individual product names?

(I'm OK with commas, personally, but readability my be an issue.)

:)

ML

February 27, 2009 3:11 PM
 

Adam Machanic said:

Hi Matija,

You're right, there are commas in there, and that was probably a bad choice of delimiter on my part.  But we'll stick with it.  Please preserve any commas and formatting already in the data.  Add only a single additional comma, and no additional white space, in your output.

February 27, 2009 3:15 PM
 

Roy Harvey said:

In the copy of AdventureWorks that I am working with, the column Production.Product.Name has 214 rows with commas already embedded in the data.  (290 rows have no comma.)  Do you expect that data to be handled in any particular way?  Unlike the specifications you provided the embedded commas all appear to be followed by a blank.

February 27, 2009 3:18 PM
 

Adam Machanic said:

Update: Sample output is now attached to this post. If your query can match that output, you're already halfway there. See how easy I'm making this?

February 27, 2009 3:18 PM
 

Roy Harvey said:

Never mind!

February 27, 2009 3:18 PM
 

Cade Roux said:

Looks like we're leaving out customers without order, correct?

February 27, 2009 3:40 PM
 

Peso said:

Your example file for TotalDollarAmount is erraneous.

For Customer ID 1 (Orlando Gee), which has 4 orders (SO43860, SO44501, SO45283 and SO46042), the TotalDollarAmount is

13216.0537 + 23646.0339 + 34066.1881 + 31423.5209 = 102351.7966

Not 1967445.6889

February 27, 2009 3:42 PM
 

Adam Machanic said:

Cade: Yes

Peso: Just making sure you were paying attention <g> ... the file has been updated with correct values.  Thanks for pointing that out!

February 27, 2009 3:55 PM
 

Cade Roux said:

I'm sorry to say I'm not terribly familiar with the AdventureWorks schema.  What's the rule for choosing Store Contacts for a customer of store type?  I've got the "hard stuff" working, but I'm having trouble selecting a First Name and Last Name for "Store" type customers who have multiple contacts, like CustomerID = 2:

FirstName LastName

Keith Harris

Geraldine Spicer

So I'm ending up with multiple rows for some customers.

I find there are some customers with both owners and purchasing managers.  I didn't see any useful views or UDFs in the database.

February 27, 2009 4:07 PM
 

Peso said:

A dumb question.

Where do I send my solution?

February 27, 2009 4:11 PM
 

Adam Machanic said:

Hi Cade,

Just go straight to Person.Contact with the ContactID.

February 27, 2009 4:11 PM
 

Adam Machanic said:

Peso:

[my first name] [at] [this site]

February 27, 2009 4:17 PM
 

Cade Roux said:

You're talking about ContactID in SalesOrderHeader?  It turns out there's no variation within CustomerID:

SELECT CustomerID

FROM Sales.SalesOrderHeader AS SOH

GROUP BY CustomerID

HAVING MIN(ContactID) <> MAX(ContactID)

but that's probably either a bad DB design or a potential assumption loophole.

February 27, 2009 4:20 PM
 

Adam Machanic said:

OK, then there is no problem :-)

Just use the ContactID to get the names, not the CustomerID, and you'll be fine.

February 27, 2009 4:23 PM
 

Peso said:

Just for comparison, how many seconds and how many reads do your best attempt use?

February 27, 2009 4:41 PM
 

Adam Machanic said:

Peso,

Part of the challenge is wondering whether you've found the best solution. If post my best and you do worse, you'll know -- and that's not nearly as fun... Plus, I don't want anyone thinking that my best is necessarily THE best and only trying to go that far. So I will share numbers only at the end.

February 27, 2009 4:54 PM
 

Peso said:

Fair enough.

February 27, 2009 4:57 PM
 

RBArryYoung said:

Hmm, well unless you are intentionally baiting us, you are either really going to like my solution or really, really hate it. :-).

March 1, 2009 2:54 PM
 

Michael Zilberstein said:

You Excel output example is sorted by CustomerID column - it it a requirement?

March 2, 2009 2:34 AM
 

Brian Tkatch said:

March 2, 2009 9:51 AM
 

Adam Machanic said:

Michael: No, there is no sorting requirement.

March 2, 2009 9:58 AM
 

Peso said:

I am looking forward to the presentation of results, together with solutions.

There are always new things to learn!

March 2, 2009 12:29 PM
 

RBarryYoung said:

I thought all of you folks would be at the MVP Summit?

March 2, 2009 10:14 PM
 

daveballantyne said:

Which Adventure works should we be using ?

Ive downloaded AdventureWorksDB.msi from the Codeplex site Brian gave

but ContactID 1 is  "Gustavo Achong" not "Orlando Gee"

March 3, 2009 5:37 AM
 

daveballantyne said:

Also ,  do we have to use the base schema , or are we allowed to add indexes ?

March 3, 2009 7:09 AM
 

Adam Machanic said:

Dave:

No indexes.  Check your query and read all of the comments here; you might be joining incorrectly.

March 3, 2009 10:49 AM
 

Mister Magoo said:

Have I got the wrong data or have you prefixed all order numbers with "SO"?

Thanks

MM

March 8, 2009 3:49 PM
 

Adam Machanic said:

I've not prefixed anything.  The order numbers are already that way.  At least on my end; if yours aren't perhaps you need to reinstall AW before doing this challenge.

SELECT SalesOrderNumber

FROM Sales.SalesOrderHeader

March 8, 2009 5:46 PM
 

Mister Magoo said:

Thanks Adam, I was stupidly using the SalesOrderID...not familiar with AW data..

Anyway my submission is sent - thanks for a fun challenge.

March 8, 2009 8:16 PM
 

Virgil Rucsandescu said:

This is indeed quite a challenge. I am looking forward to the presentation of solutions.

March 11, 2009 1:22 AM
 

Dan Holmes said:

I and seeing that "Gustavo Achong" is contact 1 too.  No join on this one just SELECT ... FROM Person.Contact WHERE ContactID = 1

I wasn't worried about the difference until i read your response.

March 13, 2009 8:29 AM
 

Adam Machanic said:

Just over two weeks ago I posted the Grouped String Concatenation Challenge . A more difficult challenge

March 15, 2009 8:41 PM
 

Alexander Kuznetsov said:

String concatenation in SQL is frequently discussed on newsgroups and blogs. Adam Machanic recently posted

March 29, 2009 2:39 PM
 

Peso said:

Out of curiosity I checked out http://msdn.microsoft.com/en-us/subscriptions/renew.aspx and found the only MSDN subscription matching the prize sum was "Visual Studio Team System 2008 Team Suite with MSDN Premium".

It's fantastic!

April 8, 2009 3:12 AM
 

GATTU KUMAR said:

Respected Sir,

  Sir presently i am working in onsite as junior database developer.I have been facing problem with Strings Concantenation.Sir i did one example

in that i am getting error.plese solve this example.

Example:

      Declare @a as varchar(1000)

  set @a='select e.empno,e.firstname'+' ' +'e.lastname as ename,e.salary from emp e'

      exec (@a)

Output:incorrect syntax near '.' iam unable to solve this problem

   plese correct this error.and send url to me the topic about strings Concantenation hopes u to send replay to me

May 10, 2009 10:48 PM
 

Adam Machanic said:

After weeks of putting it off, I finally found the time and spent the last day and a half judging the

May 31, 2009 4:58 PM
 

JAY said:

WHERE CAN I SEE THE ENTRERED / WINNER'S SOLUTIONS???

December 3, 2009 2:15 PM
 

Owais Ahmed said:

Very easy challenge for me. Did you not have tough challenges?

April 6, 2011 12:20 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement