THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Faster, More Scalable SQLCLR String Splitting

It seems like every couple of months we see yet another post on SQLCLR string splitting routines. Many bloggers, I suppose, are still struggling, even three years later, to find that "perfect" use case for SQLCLR. Is string splitting it? Probably not. And with SQL Server 2008 table-valued parameters now available, SQLCLR string splitting has become an even less interesting exercise than it was before. None the less, a recent post on the Less Than Dot site has inspired me to counter with my own best SQLCLR string splitting method, for those of you who are still interested in solving this problem.

I've noticed that almost invariably, the methods posted online stress how very easy it is to do string splitting in .NET, thanks to the String.Split method. And while this easy method tends to work pretty well for small strings and on workloads that don't need to scale, it quickly breaks down when any amount of load is introduced (something that, unfortunately, most writers don't bother considering). The Less Than Dot writer, "onpnt" did do some testing, and discovered that--surprise, surprise--String.Split isn't all that great.

The issue? Well, it all comes down to large memory allocations and the art of scalable .NET programming--an area about which many SQL Server developers can (and do) remain blissfully naïve. In .NET, reduction of memory utilization--especially large allocation done en masse--is king, and String.Split does exactly the wrong thing. It takes the input string, breaks it into N chunks, and allocates all of the memory needed to store those chunks and pointers to those chunks, in one big huge operation. This can't possibly scale, and indeed it doesn't.  I did a quick SQLQueryStress test of a TVF based on String.Split and saw fairly good performance when the input sentences were small (in the 40-400 byte range--see below), but after a certain point the AppDomains began recycling and performance became abysmal. Protections put in place for stability of the CLR host include memory leak detection, and this kicks in quite readily when we force allocation of so much memory in one shot--a good thing for the SQL Server instance as a whole, but not great when we're trying to really split a huge string.

Students of .NET who are concerned with scale (and really, everyone should be) are urged to look at the way problems are handled in LINQ. Here the vast majority of requests are internally handled using streaming iterator patterns, rather than moving around huge chunks of memory. This turns out to a much more scalable option for several reasons: Lower in-flight memory utilization, fewer large object heap allocations, and better access by the garbage collector to collect intermediate data that is no longer needed.

So how can we apply streaming to the string splitting problem? Rather than break the string up into all of its component parts upfront, we can walk the string step-by-step, only finding the next piece as required. In order to handle this, I created the following worker class:

    public class splitIt : IEnumerator
    {
        public splitIt(string theString, char delimiter)
        {
            this.theString = theString;
            this.delimiter = delimiter;
            this.lastPos = -1;
            this.nextPos = -1;
        }

        #region IEnumerator Members

        public object Current
        {
            get { return theString.Substring(lastPos, nextPos - lastPos).Trim(); }
        }

        public bool MoveNext()
        {
            if (nextPos >= theString.Length)
                return false;
            else
            {
                lastPos = nextPos + 1;

                if (lastPos == theString.Length)
                    return false;

                nextPos = theString.IndexOf(delimiter, lastPos);
                if (nextPos == -1)
                    nextPos = theString.Length;

                return true;
            }
        }

        public void Reset()
        {
            this.lastPos = -1;
            this.nextPos = -1;
        }

        #endregion

        private int lastPos;
        private int nextPos;
        private string theString;
        private char delimiter;
    }


This class is a simple enumerator implementation that looks for the next delimiter on each iteration, only when requested. Splitting strings in this way, rather than using String.Split, means that no huge upfront allocation takes place. Aside from the sentence itself, only one "chunk" is in play at any given time, and any chunks that have already been used can be garbage collected as needed when memory is tight.

Wiring this class up in a SQLCR TVF is just about as simple as when using String.Split:

    [Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName = "FillIt", TableDefinition = "output nvarchar(4000)")]
    public static IEnumerator faster_split
        (
            SqlChars instr,
            [SqlFacet(IsFixedLength=true, MaxSize=1)]
            SqlString delimiter
        )
    {
        return (
            (instr.IsNull || delimiter.IsNull) ?
            new splitIt("", ',') :
            new splitIt(instr.ToSqlString().Value, Convert.ToChar(delimiter.Value)));
    }

    public static void FillIt(object obj, out SqlString output)
    {
        output = (new SqlString((string)obj));
    }

I've enhanced this example slightly compared with most of the usual suspects: A SqlFacet attribute is used to make sure that the delimiter is only a single character, and I've added a bit of additional code in the main method to deal with NULL inputs.

The scalability difference between this method and the String.Split method is staggering in the simple tests I ran today on my SQL Server 2008 test server.  With small sentences, even under moderate load (100 concurrent threads), each method performs more or less equivalently.  But as sentence size increases to 50KB, the String.Split method begins slowing, taking almost 2 seconds per iteration, and the occasional AppDomain recycle is seen in the SQL Server log.  The streaming method, on the other hand, continues to complete its job in just over 1/10th of a second, and causes no AppDomain recycles. Increasing to 500KB sentences, String.Split causes numerous AppDomain recycles and time per iteration increases to over 30 seconds, while the streaming method averages just 16 seconds per iteration. Jumping to 5MB sentences, String.Split causes almost continuous AppDomain recycles, and each iteration takes almost 6 minutes to complete. Yet with the streaming method, even with sentences of this size I am still unable to cause an AppDomain recycle to occur, and iterations complete in around 55 seconds.

The test I did was quite simple, and I won't post too many details here as I prefer that you test with your own workloads and draw your own conclusion about how this method fares when compared with T-SQL versions or the naïve String.Split method. I hope that if you do test you'll post back here with your results so that we can all learn the best way to handle these problems--whether or not string splitting really is all that interesting in the post-SQL Server 2005 world.

Published Sunday, April 26, 2009 8:59 PM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Ted Krueger said:

Very impressive Adam as with anything I've read of yours.  I'm glad the one on LTD got you up to writing this and that I had the chance to get the feedback on the imperfections in mine along with the more efficient ways of performing the task.   It’s easy to say things can be better, but showing it like you have here is the best way we can learn from it.   I know I have!  -Ted

April 26, 2009 9:13 PM
 

Paul White said:

Hey Adam,

This debate is currently running over at SQLServerCentral:

http://www.sqlservercentral.com/Forums/Topic695508-338-1.aspx?Update=1

We used a very similar TVF to yours, and found that leaving SqlChars.Value as char[] improved the speed of the split slightly.

That is the current fastest CLR TVF for most cases, though for larger strings with fewer delimiters, a RegEx split is even faster.

A nice graph showing the two fastest CLR TVFs versus two of the best T-SQL approaches can be found here:

http://www.sqlservercentral.com/Forums/FindPost704614.aspx

The other interesting thing to come out so far is that the CLR TVFs are very significantly faster than tally (numbers) table based T-SQL routines, long held to be the best way to split strings.

There are a large number of tests in that thread, from small strings to splitting the entire text of Moby Dick!

Finally, it slightly sucks that while the TVF can stream its output via IEnumerator and IEnumerable, it is a shame that streaming input can't be achieved at the same time, either via the SqlChars or by using a SqlDataReader to read the rows-to-split from a persistent table.

Cheers,

Paul

April 27, 2009 8:11 AM
 

Adam Machanic said:

Hi Paul,

Thanks for the pointer.  I just took a quick look through the thread but couldn't find any methods similar to mine.  I saw the "chars" method, but it has the same problem as String.Split: It does all of the splitting work upfront, producing an intermediate collection (it uses an ArrayList -- if I've looked at the wrong solution please point me to the correct one).  I didn't see anything in there about anyone actually load testing these methods, and that's where the scalability factor comes into play.  The upfront allocations simply will not scale.

By the way, SqlChars does stream the data in, and if I can figure out a way to leverage the char collection without doing an upfront large allocation I'll post back an updated solution that should be even better.

April 27, 2009 8:50 AM
 

Paul White said:

Thanks for the quick reply - and, yes, you are quite right - all the solutions there do the allocation up-front, not using IEnumerator methods explicitly.  I do apologize for my confusion there.  Sorry.

The reason for my confusion was that I had experimented with a solution using explicit IEnumerator methods, when trying to write a CLR TVF to read all the strings to split directly from the source table.  The rows were read into a SqlDataReader, and the plan was to stream the input from the Reader and also stream the output in the IEnumerator methods.

(BTW yes SqlChars can stream, but this method was trying to read all rows at once, rather than streaming individual rows...)

Sadly, it failed for two reasons:

(a) CLR TVFs can only do data access in the InitMethod, not in the method called via FillRowMethodName, so I would have had to read all the data up front again.

(b) Sending the results down the SqlPipe row-by-row using a SqlDataRecord and associated metadata is really slow.  Building a DataTable in a DataSet and sending a SqlDataReader from the DataSet is even worse.

My apologies again.

Paul

April 27, 2009 9:27 AM
 

Adam Machanic said:

Hi Paul,

No need to apologize for anything :-)

I did some work with the SqlPipe for a running sums problem and found it to be plenty fast for my needs.  How big was the data you were returning?

April 27, 2009 11:01 AM
 

Paul White said:

It was returning the split string results from Jeff Moden's test table.  This generates 800K rows of single small strings O.0

It wasn't that slow in absolute terms, it was just a little disappointing that it was slower than the CROSS APPLY with the original CLR TVF.  It was about the same speed as the simple tally table solution, which was one of the worst performers overall.

Of course, it may be simply that my C# skills are not all they could be!

The table-at-once approach was never intended to be a good addition to the SQL CLR developer's toolkit, for anyone wondering - it was an attempt to find the fastest possible solution for the original posted problem on SSC.

It's memory allocation in particular would be horrible - I just wanted to find a way to cut down the round trips between the database engine and CLR, and to reduce the number of calls (late-bound by reflection?) to the IEnumerator methods.

I was apologising for replying too quickly without taking the time to fully digest your solution.  I stand by that!  I will learn :c)

Paul

April 27, 2009 4:52 PM
 

Adam Machanic said:

Two days ago, after posting what I thought was a pretty solid SQLCLR string splitting method , I received

April 28, 2009 3:08 PM
 

M said:

Have you investigated Regex.Split, using a precompiled regex?

April 30, 2009 3:07 AM
 

Paul White said:

M,

Yes - see the link to SSC in Adam's article.  The Regex is modestly faster for very large strings - but see Adam's next post for the ultimate solution.  As you will see, the problem with all the CLR methods up to this point (including a pre-compiled Regex) is that memory is allocated for the entire split up front...

Cheers,

Paul

April 30, 2009 4:25 AM
 

Adam Machanic said:

M: As Paul mentioned, Regex.Split is out, due to upfront memory allocation.  Precompiled regex might still be an option, but I can't find a way to stream the results (i.e., an IndexOf method or something similar).  Do you have any ideas?

April 30, 2009 10:03 AM
 

Simon Sabin said:

What if you stream the positions back and then use substring in the TSQL. This avoids the need to make any copies of the strings.

May 4, 2009 4:58 PM
 

Paul White said:

Simon,

Wouldn't SQL Server be making a copy of the string when it performs the substring? ;c)

Paul

May 4, 2009 5:53 PM
 

Adam Machanic said:

I tried something similar in the past without much success, but it was 100% T-SQL.  I'm not sure if the performance issue was with SUBSTRING or CHARINDEX but on VARCHAR(MAX) it was pretty bad.  I'll have to test and see if that makes a difference.  Only issue is that it would require two separate modules to do a split, which becomes a bit more of a maintenance nightmare.

May 5, 2009 1:11 PM
 

TechVsLife said:

Thank you, this is very useful (though it should be included as a built-in T-SQL function).  I haven't read through all the many pages spilled on this subject, but I assume the conclusion is that your CLR version is among the fastest, and faster in most scenarios than the T-SQL version?

btw, I added a comment about an update to your pure T-SQL version, at:

http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/splitting-a-string-of-unlimited-length.aspx

July 19, 2009 12:28 PM
 

Dave said:

Hi Adam

Would you mind please posting the whole code for your example?  I am trying to learn how to create a CLR TVF do split strings, and this page comes highly recommended.  However, I can't figure out how to fill in the gaps... Thanks :-)

February 2, 2012 7:21 AM
 

Adam Machanic said:

Hi Dave,

First of all, you should be using this code:

http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx

To use: Fire up Visual Studio, launch a new C# Database Project, right-click on the project and click "Add," select any item, and then go to the editor window and paste the code in on top of the template. Build it, deploy it, and you're done :-)

Feel free to shoot back with any remaining questions!

February 3, 2012 11:29 PM
 

Jesús López said:

The function would be simpler using yield return:

public partial class UserDefinedFunctions

{

[Microsoft.SqlServer.Server.SqlFunction(

TableDefinition="token nvarchar(4000)",

FillRowMethodName="GetNextString")]

public static IEnumerable Split(

[SqlFacet(MaxSize=-1)] SqlString str,

[SqlFacet(IsFixedLength=true, MaxSize=1)] SqlString separator)

{

if (str.IsNull || separator.IsNull) yield break;

char splitter = separator.Value[0];

int currentPosition = 0;

int nextPosition = 0;

string inputStr = str.Value;

int strLength = inputStr.Length;

do

{

nextPosition = inputStr.IndexOf(splitter, currentPosition);

if (nextPosition == -1) nextPosition = strLength;

yield return inputStr.Substring(currentPosition, nextPosition - currentPosition).Trim();

currentPosition = nextPosition + 1;

} while (currentPosition < strLength);

}

public static void GetNextString(Object obj, out SqlString str)

{

str = new SqlString((string)obj);

}

};

January 24, 2013 5:41 AM
 

JG Coding said:

give my CLR JSON Parser a try. I believe it is the only one which deploys with SAFE permissions. It uses regex with recursion to blow through the string. https://github.com/jgcoding/J-SQL.git. I am working alternate params and functions to control the depth of the parse.

Still working on dealing with pre-foratted (prettified) strings, those with line-feeds, carriage returns, and tabs not residing within a valid string value. Special characters residing within a string value pose no issue.

June 12, 2013 3:18 PM
 

Adam Machanic said:

Thanks for sharing, JG. I've not yet had any reason to work with JSON from SQL but I'm sure someone will find it useful.

June 12, 2013 4:11 PM
 

Comparing Simple Efficiencies: T-SQL UDF vs SQLCLR UDF for Splitting Strings | sqlbelle's musings said:

November 10, 2014 5:48 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement