THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Eric Johnson

My Thoughts on SQL Server and Database Related Technologies.

The Trouble with SSIS Sorting

Many SSIS transformations, such as the Pivot and the Data Profiling Tasks, require that the data being fed into them be sorted. Without first sorting, some transformations will throw an error and not run, while others will run but the results will not be as expected. The answer is to sort the data before it is fed into the transformation; the problem is that this isn’t as simple as you may think. The obvious and simple way to accomplish this is to use the Sort Transformation in SSIS. The Sort Transformation will complete the objective of sorting the data, but the performance might not be exactly what you would expect. The Sort Transformation does all its work in memory; SSIS packages do things in individual sets. This means your data may need to be sorted and resorted again. If you try to work with large data sets, this sorting will take a long time and use a lot of memory. The solution is to sort the data from the source using an ORDER BY clause. That alone, however, won’t get it done. You also have to tell SSIS that the data is sorted and how it is sorted. This is done in the Advanced Editor of your data source. First, set the IsSorted property of the source’s output to true. Next, set the SortKeyPosition property of each column you specified in your ORDER BY clause; 1 for first, 2 for second, and so on. A SortKeyPosition value of 0 is the default and means that the column is not used in the sort. That’s all there is to it, using this method you will be able to sort your data so the SSIS transformations are happy, and can work without hours of waiting and memory hogging.

Published Wednesday, February 03, 2010 8:00 AM by ejohnson2010

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Andrew Ingram said:

One point to note on sorting using an order by in the original select is that if you need to sort later in the transform and join the results, the way the source database sorts must be the same as the way SSIS sorts - letters before numbers and vice versa being the prime example.

February 3, 2010 9:15 AM
 

Josh Grant said:

Expanding on Andrew's point, I've found that it is UNSAFE to rely entirely upon database sorting, and I am now avoiding one particular data flow component which requires sorted inputs.

Here's an example you can build yourself...

- Locate (or create) a SQL Server database with the SQL_Latin1_General_CP1_CI_AS (case insensitive) collation

- Create a data flow with two OLEDB sources, pointing to these two queries.  Note that query #2 is identical to query #1, with one extra row.

-- Query #1

SELECT  'ABb' AS Col1,

       1 AS Query1RowNumber

UNION ALL

SELECT  'AbB',

       2

ORDER BY Col1

-- Query #2

SELECT  'ABb' AS Col1,

       1 AS Query2RowNumber

UNION ALL

SELECT  'AbB',

       2

UNION ALL

SELECT  'Ab',

       3

ORDER BY Col1

- Using the advanced editor, set the "IsSorted" property on each data source, and set the "SortOrder" to 1 for the Col1 column

- Add a merge join component, join type "full outer join".  Make "Col1" the join key, and allow Query1RowNumber and Query2RowNumber to pass through.

- Add a destination of some kind, and put a data viewer on the output of the merge join

The expected result is this, right?

Query1RowNumber Query2RowNumber

--------------- ---------------

1               1

2               2

NULL            3

What you actually get is:

Query1RowNumber Query2RowNumber

--------------- ---------------

NULL            3

NULL            2

1               1

2               NULL

Row number 2 (Col1 = 'AbB') isn't matched in the full outer join, and it is returned as two "unique" rows!

My conclusion from this is that SSIS components which require sorted inputs actually require "SSIS-sorted" inputs, or they can return unexpected results!

May 27, 2010 10:01 AM
 

Mike said:

Microsoft is promoting the idea that doing everything the memory is great. I does definitely speed up the process plus it help large companies to sell hardware and memory. Sorting algorithms are very well documented and it is difficult to invent something new.

When working with large datasets there are ways of sorting data without buying new hardware

More information

http://www.etl-tools.com/etl-tools/processing-data/sorting.html

October 10, 2010 11:39 AM
 

justins707 said:

I think the diff is that you are using a UNION ALL.  If you change the Source SQL to use UNION (without ALL).  Then the results look correct,

null   3

1      1

2      2

Sort Order is null, Ab, ABb, AbB.

October 27, 2010 7:22 PM
 

Chris Nyberg said:

With Ordinal Technology's Nsort for SSIS sort component, you can quickly sort data sets even if they do not fit in memory. The Nsort component uses a temporary file to avoid the page faulting coma the standard SSIS component can fall into.  See http://www.ordinal.com/ssis.html

February 17, 2011 12:49 AM
 

Jemsy Villanera said:

Good Post, thanks It help me a lot :)

March 15, 2012 5:11 PM
 

Emil Glownia said:

Interesting post and comments.... I was looking to do some research before I create some more videos for our SSIS Transformations Tutorial and got more ideas than I expected.

I got also interested in why removing ALL would make a difference... and that is only because Case Insensitive collation was used so it is "random" sorting or more precisely UNION and UNION ALL uses different sort for "the same" values.

with Case Insensitive this query:

SELECT  'ABb'AS Col1,       1 AS Query2RowNumber

UNION ALL

SELECT  'AbB',

      2

UNION all

SELECT  'Ab',

      3

ORDER BY Col1

Returns 3,2,1 but with DESC at the end

SELECT  'ABb'AS Col1,       1 AS Query2RowNumber

UNION ALL

SELECT  'AbB',

      2

UNION all

SELECT  'Ab',

      3

ORDER BY Col1 desc

Result is not reverse 1,2,3 but 2, 1, 3... if I take out ALL than result is 1, 2, 3 because 1 and 2 is the same value (with case insensitive collation), obviously Case Sensitive Collation should be used instead but this is a good starting point for me to make some SSIS test ;)

Regards

Emil

November 19, 2012 6:14 AM

Leave a Comment

(required) 
(required) 
Submit

About ejohnson2010

Eric (Microsoft SQL MVP) is the co-founder of Consortio Services, and the primary Database Technologies Consultant. His background in Information Technology is diverse, ranging from operating systems and hardware to specialized applications and development. He has even done his fair share of work on networks. Since IT is really just a way to support business processes, he has also acquired his MBA. All in all, he has 10 years of experience with IT, a great amount of which has been working with Microsoft SQL Server. Eric has managed and designed databases of all shapes and sizes. He has delivered numerous SQL Server training classes and webcasts as well as presentations at national technology conferences. Most recently, he presented at TechMentor on SQL Server 2005 Replication, Reporting Services, and Integration Services. In addition, he is active in the local SQL Server Community, serving as the President of the Colorado Springs SQL Server Users Group.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement