THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Oracle Index Skip Scan

There is a feature, called index skip scan that has been in Oracle since version 9i. When I across this, it seemed like a very clever trick, but not a critical capability. More recently, I have been advocating DW on SSD in approrpiate situations, and I am thinking this is now a valuable feature in keeping the number of nonclustered indexes to a minimum.

Briefly, suppose we have an index with key columns: Col1, Col2, in that order. Obviously, a query with a search argument (SARG) on Col1 can use this index, assuming the data distribution is favorable. However, a query with the SARG on Col2 but not Col1 cannot use the index in a seek operation.

Now suppose that the cardinality of Col1, (the number of distinct values of Col1), is relatively low. The database engine could seek each distinct first value of Col1 and the specified SARG on Col2. Microsoft SQL Server currently does not have the Oracle Index Skip-Scan feature, but the capability can be achieved with a work-around.

In this example, the LINEITEM table has a cluster key on columns L_SHIPDATE, L_ORDERKEY, but does not have an index leading with L_ORDERKEY. Our query is to find a specific Order Key in the LineItem table. If there is a table with the distinct date values, DimDate, we could force a loop join from the DimDate table to LineItem (even though only columns from LineItem are required) to get the execution plan below.

SkipScan

The question is now: how effective is this technique? The most efficient execution plan is of course, to have an index leading with the Order Key column. But the situation calls for keeping the number of nonclustered indexes to an absolute minimum. So how does the above execution plan compare with a table scan?

A table scan, in this type of query, such that only few rows meet an easy to evaluare SARG, might run at about 1GB/s per core. Note this is far higher than the 200MB/sec cited in the Microsoft Fast Track Data Warehouse documents. This is because the FTDW baseline is a table scan that aggregates several columns of every row. And not only that, a Hash Match is also required to group the results. Basically, a needle in haystack table scan runs much faster than the more complex aggregate and group scan. At most, a 1GB table scan might acceptable for a non-parallel execution plan and even a 50GB table scan could be tolerable on a powerful 32-core system with an unrestricted parallel execution plan.

A loop join can run somewhere in range of 100,000-200,000 seeks/sec to the inner source. Realistically, a Data Warehouse with 10 years data has distinct 3652 days (depending on the leap year situation). A loop join with 3650 rows from the outer source should run some where around 36ms. Even if the DW had 20 years data, this is still acceptable, on the assumption that the non-lead column Order Key search is in the minority, with the plus being one less index on the big table is required. If the query could be bounded to with a single year or quarter-year, then we are approaching the efficiency of having the extra nonclustered index.

Published Monday, June 13, 2011 12:55 AM by jchang
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

robnoi said:

Regarding the discussion about the "skip scan", there is now a Microsoft Connect item: https://connect.microsoft.com/SQLServer/feedback/details/695044/implement-index-skip-scan

Please vote for it.

October 15, 2011 10:32 AM
 

jchang said:

Well the connect request could have included a link to my blog. I am sure Conor loves my suggestions and will promptly put all my requests to the top of his (recycle) bin

October 18, 2011 11:39 PM
 

tobi said:

I just added a link to this post from the connect item (I opened the connect item originally). Please vote on it.

June 7, 2012 4:46 PM
 

Daniel Adeniji said:

Thanks.  Your insight into Storage Performance is groundbreaking & breathtaking.

I like your frustration about certain things.  And, how you take the minimal, engineering approach to problem isolation and solving.

And, turn around and ask each of us, have you thought about this problem, as well.  And, how did you solve it -- By adding another index?

March 23, 2013 12:16 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement