THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Stacia Misner

24HOP: BI Workload Follow-Up

Yesterday, Denny Cherry (blog|twitter) and I co-presented a 24HOP session for the Fall 2011 lineup, “So How Does the BI Workload Impact the Database Engine?” 24HOP stands for 24 Hours of PASS and is a semiannual roundup of speakers from the SQL Server community. Initially, this event consisted of 24 consecutive sessions, each lasting an hour, but later it became a two-day event with 12 consecutive sessions each day. The sessions are free to attend and feature many great topics covering the spectrum of SQL Server things to know. Even if you missed previous 24HOP events, you can always go back and view recordings of sessions that interest you at the 24HOP site for Spring 2011 and Fall 2010.

And if you missed Denny and me yesterday, a recording will be available in a couple of weeks and I’ll update this post with a link. Our hour-long session for 24HOP was a sneak preview of our upcoming half-day session of the same name that we’ll be presenting at the PASS Summit in Seattle on Thursday, October 13, 2011 from 1:30 pm to 4:30 PM. In our half-day session, we’ll dig into the details and spend more time on database engine analysis, whereas in our 24HOP session, we focused on reviewing the architecture and highlighting the connection between BI components and the database engine.

We were able to answer a few questions at the end, but one question in particular could not be answered easily in the time allotted in a single sentence or two: How much RAM do I need to plan for Integration Services (SSIS)? Andy Leonard (blog|twitter) did manage a succinct response: All of it! I, on the other hand, am not known for being succinct, so deferred the question for this post.

Andy is right that SSIS wants as much memory as you can give it, which can be problematic if you’re executing an SSIS package on the same box as SQL Server. On the other hand, there are benefits to executing the package on the same box as well, so there is no one-size-fits-all solution. And the solution for one data integration scenario might not be the right solution for another data integration scenario. A lot depends on what CPU and RAM resources a given server has and how much data is involved. In order to know how much horsepower you need, you’re going to have to do some benchmark testing with packages. Here are some good resources for SSIS if you’re concerned about memory:

Is there a rule of thumb for deciding how much memory you’ll need for SSIS? Well, no less than 4 GB per CPU core is a good place to start. But if that’s not possible, you certainly want to have memory that’s at least two or three times the size of data that you expect to be processing at a given time. So if you’re processing 1 GB of data, you’ll want at least 2-3 GB of memory and, of course, more memory is even better!

Published Thursday, September 08, 2011 10:16 AM by smisner
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Rafi said:

Hi Stacia,

Thanks for the session yesterday, looking forward to meet you on the summit. It was me who asked that specific question in the 24HOP session yesterday, and I'm glad you took the time to answer it deeply.

We have a rather huge DW, which data derives from 150+ files, some are with over 100 million rows in them, these files are loaded into staging tables using BCP, and this is a constant. The thing is, that at the moment we are rebuilding this huge DW from scratch each week (I'm working on changing this concept, but I'm new in the company so for the mean while this is how it's being done).

The server has 128 GB of RAM, so if I understand you correct, then I will conclude, that giving the SQL Server most of the RAM will not serve me well in my current scenario, if we choose to switch to using SSIS from ETL.

I will need to divide the allocation in a way, so that SQL Server have enough RAM to deal only with the reads and writes that the SSIS packages will create,  and give the rest to SSIS for the transformations impact. Because SQL Server is only reading data once from the staging tables, having a big buffer pool for keeping these pages is not giving me any performance, is that correct?

Another thought, will it be wrong to apply, that when working with these limitations (that is that the data is already loaded into local staging tables), that SSIS may not be the right tool to use. What I’m thinking is wouldn’t it be a good idea, to let SQL Server do the transformations directly on the buffer pool? SSIS data flow reader will put some of it on the SQL buffer pool any way while reading, why have the same pages on the RAM twice, one for SQL to deliver for SSIS, and one for SSIS internal cache?

Cheers

Rafi

September 9, 2011 2:17 AM
 

smisner said:

Hi Rafi,

I would ask what is the problem you're trying to solve? Faster loads? Less impact on the network? Etc. If I were pressed to make the case on behalf of SSIS, I would say the following:

- Automation of the load process

- Handling transformations in memory – which could be faster than executing T-SQL operations in some situations

- Configurability across multiple environments (I.e. Changing connection strings, etc.)

- Dynamic run-time changes through the use of expressions to control behavior

- Customized error handling

- Ability to execute multiple data flows in parallel

- Debugging capabilities

Here's a comment from the SQL Server Performance Team (http://blogs.msdn.com/b/sqlperf/archive/2008/02/27/etl-world-record.aspx)

Why not just do a bulk load of the data?

It is rare in businesses today that data is always available on the destination system, and does not need to be standardized or corrected for errors before loading. These rare cases are the times that bulk loading data makes sense. Data integration can involve complex transformation rules, error checking and data standardization techniques. ETL tools like SSIS can perform these functions such as moving data between systems, reformatting data, integrity checking, key lookups, tracking lineage, and more. SSIS has proven itself to be a versatile ETL tool, and now it is shown to be the fastest one as well.

But perhaps the best explanation of when to use SQL versus SSIS comes from Jamie Thompson at http://consultingblogs.emc.com/jamiethomson/archive/2006/03/14/SSIS_3A00_-Data-flows-or-T_2D00_SQL.aspx. Be sure to read the "spirited" discussion in the comments below the post!

September 9, 2011 4:27 PM
 

smisner said:

And Denny had these comments with regard to the database engine aspects of your question:

Without SSIS running on the server I’d probably max SQL Server out at 120 Gigs of RAM.  With SSIS running on the server you’ll want to run the package and see how much memory SSIS needs then reconfigure SQL Server accordingly to leave that memory for SSIS.

As to your buffer pool question, just because the SQL Server isn’t reusing the same data pages doesn’t mean the buffer pool isn’t doing anything for you performance wise.  SQL Server will preload data pages into the buffer pool  when queries are run so that it makes less trips down to the physical disk as part of the readahead process.  The way this works is that as the SQL Server’s storage engine figures out what data is being asked for more parts of the table will be read from disk faster than the query engine actually needs it so that the data is in the buffer pool by the time that the query engine gets around to reading it.

September 9, 2011 5:06 PM
 

Rafi said:

I totally agree that SSIS has many advantages regarding loading data from files into SQL Server, but as I mentioned, the extern supplier of the data, is by contract responsible for delivering the data into our SQL Server staging area. That is not to be changed any time soon, I was told.

It than puts me in a point where I need to find out, if there is a point in doing the transformations and lookup in memory using SSIS, or should I let SSIS do the parallel execution, logging etc, and use T-SQL Stores procedures for the data flow between tables that all are on the same SQL Server instance. The source and destination databases will always be the same physical server and same SQL Server instance.

September 12, 2011 4:14 AM
 

smisner said:

The only way to find out is to set up a baseline test using each approach and measure the results.

September 14, 2011 12:55 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement