THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Tokenize UDF

Yes, another string splitting UDF from a guy who's obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, "Tokenization."

So what is Tokenization? It's a word I made up for this problem.

But what is it, really? It's splitting up a string based on a delimiter -- in this case, a comma -- but being wary of substring delimiters. In this case, that's a pair of apostrophes, because that's what TSQL uses for strings.

I think this is best illustrated with an example string:

 

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

The basic split string function that you can find will produce the following output:

 

SELECT * 
FROM dbo.SplitString(@Tokens, ',')

OutParam
-------------
a
'b'
''c'
'd'
'e''
f
'1
2
3
4'

Well, that's wrong. Because what I want to do is maintain the substrings (or, "tokens," as I like to call them -- thus, Tokenization!)

The output I desire is:

 

Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

Notice that substrings -- delimited with apostrophes -- should be maintained.

And here's how I've solved this problem...

 

CREATE FUNCTION dbo.Tokenize
(
@Input NVARCHAR(2000)
)
RETURNS @Tokens TABLE
(
TokenNum INT IDENTITY(1,1),
Token NVARCHAR(2000)
)
AS
BEGIN
DECLARE @i INT SET @i = 0
DECLARE @StartChar INT SET @StartChar = 1
DECLARE @Quote INT SET @Quote = 0

DECLARE @Chars TABLE
(
CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
TheChar CHAR(1),
TheCount INT,
StartChar INT
)

SET @Input = ' , ' + @Input + ' , '

INSERT @Chars (TheChar)
SELECT SUBSTRING(@Input, n.Number, 1)
FROM Numbers n
WHERE n.Number > 0
AND n.Number <= LEN(@Input)
ORDER BY n.Number

UPDATE Chars SET
@i = Chars.TheCount =
CASE
WHEN Chars1.TheChar = ','
AND @Quote % 2 = 0 THEN 0
ELSE @i + 1
END,
@Quote =
CASE
WHEN Chars1.TheChar = '''' THEN @Quote + 1
WHEN @i = 0 THEN 0
ELSE @Quote
END,
@StartChar = Chars.StartChar =
CASE
WHEN @i = 1 THEN Chars1.CharNum - 1
WHEN @i = 0 THEN @StartChar + 1
ELSE @StartChar
END
FROM @Chars Chars
JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1

INSERT @Tokens(Token)
SELECT
RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1)))
FROM (
SELECT StartChar, CharNum
FROM @Chars
WHERE TheCount = 0

UNION ALL

SELECT
MAX
(
CASE TheCount
WHEN 0 THEN CharNum
ELSE 0
END
) + 1,
MAX(CharNum)
FROM @Chars
) x
WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1))) NOT IN ('', ',')
ORDER BY x.StartChar
RETURN
END

A word of warning: This UDF uses the undocumented -- and unsupported -- "aggregate update" functionality. I've tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn't support this one, so handle with care.

And by the way, you need a numbers table to use this thing. Of course.

As for using this thing, it's pretty easy:

 

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

SELECT Token
FROM dbo.Tokenize(@Tokens)


Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

... and it even appears to work properly!

Enjoy... and application for this and other strange things I've been posting recently coming very, very soon.


Published Wednesday, July 12, 2006 10:34 PM by Adam Machanic
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jeremy Swartwood said:

Thank you for this.  To note, if there is a token and nothing but a space, your script excludes this extra "token".  In my situation I needed to always compare a specific token number so I needed this empty token.

These changes are not efficient, but they worked.

I changed the INSERT section to use a CASE instead that compared against '' and then it didn't use the LTREM/RTRIM else it used the trim.

case when RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1))) = ''

then SUBSTRING(@Input, StartChar, CharNum - StartChar + 1)

 else RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1)))

end

Additionally, I had to change the WHERE clause because SQL thinks that '' = ' '.

SUBSTRING(@Input, StartChar, CharNum - StartChar + 1) NOT LIKE ''

AND

RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1))) NOT LIKE ','

May 29, 2013 4:39 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement