Margarity Kerns recently published a very nice article at SQL Server Central on using hash functions to detect changes in rows during the data warehouse load ETL process. On the discussion page for the article I noticed a lot of the same old arguments against using hash functions to detect change. After having this same discussion several times over the past several months in public and private forums, I've decided to see if we can't put this argument to rest for a while. To that end I'm going to hold a little contest: **Generate an SHA-1 hash collision and win $100 and a book** (see bottom section for details). Before I get into the details of the contest I'm going to give a little background of how this came about.

__Background Info__

**NOTE: If you aren't familiar with hash functions I highly recommend first reading the Wikipedia article at http://en.wikipedia.org/wiki/Cryptographic_hash_function.**

The idea of using a hash function for change detection is not new. Essentially a hash function generates a "fingerprint" of your data that you can use to compare an inbound row and an existing row.

Some people are wary of hash functions because they map a theoretically infinite number of large inputs to a much smaller finite set of hash values. Most of the arguments people make against using hash functions for change detection boil down to variations of Murphy's Law:

**"There's a chance of a ***hash* *collision* [generating the same hash value for two different inputs], so a collision *will* happen!"

People have different ways of dealing with this issue, including taking one of the following positions:

- The chance of collision is negligible so no additional precautions are required.
- A collision will absolutely happen so I won't use hash functions for change detection at all!
- A collision may happen so I want to use hash values only to initially narrow down the number of rows I need to compare fully.

Positions #1 and #2 above are at different ends of the spectrum. Position #3 sits in the middle as a compromise solution. While compromises may make for good politics, they often make for terrible technical solutions, as I'll discuss below.

__Position #1: Odds of Collision are Low Enough to be Ignored__

As far as position #1 is concerned, it depends on which hash function you're using. You need to choose a true one-way *collision-free** cryptographic hash function with a wide bit length. I normally recommend an SHA-2 hash function (256, 384 or 512 bit hash value), or when that's not available the SHA-1 160 bit hash function. The odds of generating a collision with a 160 bit hash function are 2^80. That is to say you can expect a collision after you generate hashes for 1,208,925,819,614,629,174,706,176 rows of data.

Of course if you're identifying rows by their natural or business keys this alternatively means you need to generate 1,208,925,819,614,629,174,706,176 variations of that single row before you'll hit a collision with SHA-1.

To put that number in perspective, consider that Google processes 20,000,000,000,000,000 bytes (20 petabytes) of data per day. If you were to store a single row in a database table for *every single byte* Google processes each day, it would take you 60,446,290 days (approximately 156,600 years) to store 1,208,925,819,614,629,174,706,176 rows in that table.

I personally assume position #1 on this subject, with the assumption that you have chosen a good solid hash function for the job. More on this later.

*A *collision-free* cryptographic hash function is a one-way hash function with negligible probability of generating the same hash value for two different inputs. SHA-1 and SHA-256 are examples of collision-free cryptographic hash functions.

__Position #2: I Don't Trust Hash Functions__

This position can't really be argued with. As shown above the odds of a collision with SHA-1 or another collision-free hash function are extremely low. But if you don't trust it, you just don't trust it. So the alternative is to compare every inbound column with every existing column. It will cost you in efficiency on wide tables, but if you're not concerned about processing power, server resources and execution time this classic method of change detection is well-proven to be 100% effective.

__Position #3: The Compromise - Use Hash Values to Initially Narrow Down Results__

This position is the compromise position that combines the implementation of #1 and #2 above. It sounds wonderful in theory - use a hash function to narrow down your results, eliminating rows that don't need to be compared column by column; then compare all of the columns in the remaining rows that haven't been eliminated. So let's look at a scenario:

- You are processing Row A through your ETL process into a target table. Row B is the equivalent row in the target table (it has the same natural key/business key as Row A). This assumes we are first locating the equivalent row in the target table by natural key/business key of the incoming row.

There are three possible scenarios:

- Row B exists in the target table, and is equal to Row A (no change).
- Row B exists in the target table, but it is not equal to Row A (update).
- Row B does not exist in the target table (insert Row A).

Let's say you've generated two hash values, h(A) is the hash for Row A and h(B) is the hash for Row B. Now we need to use h(A) and h(B) to eliminate rows to get rid of the extra column by column comparisons. Here are the rules you need to implement to use h(A) and h(B) to eliminate extra comparisons in this compromise solution:

A. **h(A) is equal to h(B)**: according to the compromise, if h(A) = h(B) we need to compare all columns of the inbound row against the existing row since the belief is that the hash function can/will generate collisions. The idea is that h(A) may have generated the same value as h(B) even if A <> B. So we need to:

(1) Compare all columns in A and B. If A = B then perform no action.

(2) Compare all columns in A and B. If A <> B then update.

B. **h(A) is not equal to h(B)**: cryptographic hash functions guarantee that they will generate the same hash value for the exact same inputs. So we can eliminate full row comparisons if h(A) <> h(B). We know automatically that if h(A) <> h(B) then A <> B. Just perform the update.

C. **h(B) is NULL**: that is, if Row B does not exist in the target table than h(B) is NULL. This is a case where no further full-row comparisons are necessary. Just insert the row.

Now consider a slowly changing dimension (SCD) in a datamart application. Many SCDs change slowly over time (hence the name *slowly* changing dimension). This means that new rows (updates and inserts) are far less common than receiving duplicate rows during ETL. So the vast majority of your inbound data will fall under rule A(1) above. So you're still performing comparisons of all columns for the vast majority of rows in a given table just to figure out that you don't need to update them after all!

If you eliminate even 90% of the inbound rows under rule A(1) above you haven't saved much processing (you're still comparing all columns for changes for 90% of your inbound rows). You probably actually cost yourself a lot of time and efficiency since you haven't accounted for the overhead of generating hash values for 100% of the inbound rows.

The only way this compromise is more efficient is if a very large percentage of your inbound rows (much greater than 50+%) are inserts under Rule C or updates under Rule B above. If the majority of your inbound rows are duplicates of existing rows under Rule A, you gain nothing.

__The Contest__

One-way collision-free cryptographic hash functions are supposed to have negligible probability of a hash collision, or two different inputs generating the same output. Hash collisions are what cause change detection with hashes to fail.

For instance, consider the following example of an MD5 hash collision:

**DECLARE @A varbinary(8000),**

@B varbinary(8000),

@hA binary(16),

@hB binary(16);

SELECT @A = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,

@B = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70;

SELECT @hA = HASHBYTES('MD5', @A),

@hB = HASHBYTES('MD5', @B);

SELECT CASE WHEN @A = @B

THEN '@A Equals @B'

ELSE '@A Is Not Equal To @B'

END AS AB_Equal,

CASE WHEN @hA = @hB

THEN '@hA Equals @hB'

ELSE '@hA Is Not Equal To @hB'

END AS Hash_Equal;

The results are shown below:

When you run this you'll notice that the query reports the two source varbinary strings @A and @B are not equal, yet the two MD5 hashes they generate are equal. This is an example of a simple hash collision with MD5.

Now the challenge is to populate the following script with two different binary values that generate the same hash value. The output should be the same as shown above in the MD5 example.

**-- Begin script**

DECLARE @A varbinary(8000),

@B varbinary(8000),

@hA binary(20),

@hB binary(20);

-- Replace the *?* below with binary strings

SELECT @A = *?*,

@B = *?*;

SELECT @hA = HASHBYTES('SHA1', @A),

@hB = HASHBYTES('SHA1', @B);

SELECT CASE WHEN @A = @B

THEN '@A Equals @B'

ELSE '@A Is Not Equal To @B'

END AS AB_Equal,

CASE WHEN @hA = @hB

THEN '@hA Equals @hB'

ELSE '@hA Is Not Equal To @hB'

END AS Hash_Equal;

-- End script

The first person who sends me an example of two varbinary strings that generate the same SHA1 hash value will win $100 (US$) and a copy of my book Pro T-SQL 2008 Programmer's Guide.

And here are the inevitable conditions:

**No NULLs.** @A and @B in the script above cannot be set to NULL for purposes of this contest.
**8,000 bytes or less.** The T-SQL HASHBYTES function accepts varbinary(8000) values, so the values passed into it in this contest must be 8,000 bytes in length or less. The values assigned to @A and @B above must be 8,000 bytes or less in length.
**No unnecessary changes to the script.** The only change allowed to the script above are the replacement of the question marks (?) with binary strings. No other changes to the script are authorized.
**Only one person will win.** The first person who sends me a copy of the above script with two different binary values that generate an SHA-1 hash collision will win.
**Void where prohibited.** Obviously if contests like this aren't legal in your country, state, county, city, etc. then you can't take part. Petition your government to make it legal :)
**Time limits.** Entries must be received prior to midnight U.S. Eastern Standard Time on October 31, 2010.
**Decisions of the judge are final.** For purposes of this contest that would be me.
**SQL Server 2005 or 2008.** Entries must be runnable on SQL Server 2005 and SQL Server 2008 Developer Edition, and the results must be reproducible.

If a winning entry is received prior to the deadline, I'll post an update entry to the blog with the winning script and the name of the winner.

## Comments

## merrillaldrich said:

Awesome. I think, as you point out, the main problem in implementations is with using a poor hash function (checksum anyone?) AND assuming it will never have collisions.

## Gorm Braarvig, Platon said:

_64-bit hash real world sample_

I actually use MD5 on 200M rows skewed into bigint with XOR, the result is a 64-bit checksum. The 8000 byte limit is also fixed.

I hash the natural key "CS_BK" and the delta fields "CS_Delta" with calculated stored fields.

Solution has been in production for some time in a customer EDW.

The "CS_BK" is the most risky with a guesstimated probability of 25% of getting a duplicate inside the expected lifetime for the particular table.

_Implementation_

PK ...

,CS_BK AS SUBS(HASH(MD5,keys),1,8) XOR SUBS(HASH(MD5,keys),9,8)

,CS_Delta AS SUBS(HASH(MD5,vals),1,8) XOR SUBS(HASH(MD5,vals),9,8)

,...

_Performance considerations_

Special performance considerations must be taken (off topic)

_Analysis_

1) Hash delta fields, "CS_Delta"

The chance of not detecting a change is minimal, because the duplicate would have to happen within the same instance. Even if this should happen, the result will be miniscule. Risk is ignored.

2) Checksum on natural key, "CS_BK"

The chance of getting a duplicate can be calculated. Or you can use the nice table at http://en.wikipedia.org/wiki/Birthday_problem. I have not as of yet fouond a case for growing my checksum to 128 bit, but might soon. I will definitley do it if I get a 128-bit SQL Server.

Gorm Braarvig, http://twitter.com/gormb

## Michael J Swart said:

Based on my 15 minutes of web surfing, SHA-1 has some vulnerabilities, but isn't quite broken yet. From what I understand, based on the state of Sha1 research today, a winner of this contest could probably get a nice job as an academic cryptography researcher.

But your contest did get me to do a couple things. #1 is that I spent 15 minutes surfing and learning about SHA1 cryptography. #2 is that it put me squarely into the first position, i.e. the chance is so negligible that it's effectively zero.

Thanks!

## Mital kakaiya said:

Here is an another example:

--------------------------------------------

DECLARE @A varbinary(8000),

@B varbinary(8000),

@hA binary(16),

@hB binary(16);

SELECT @A = 0xa664eab88904c2ac4843410e0a63425416606c81442dd68d4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5b979ebdb40e2a6e17a6235724d1df41b44673f996f1624add10293167d009b18f75a77f7930d95ceb02e8adba7ac8555ced74cadd5fc9936db19b4ad835cc67e3,

@B = 0xa664eab88904c2ac4843410e0a63425416606c01442dd68d4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5b979ebdb40e2a6e17a6235724d1df41b44673f916f1624add10293167d009b18f75a77f7930d95ceb02e8adba7a48555ced74cadd5fc9936db19b4a5835cc67e3;

SELECT @hA = HASHBYTES('MD5', @A),

@hB = HASHBYTES('MD5', @B);

SELECT CASE WHEN @A = @B

THEN '@A Equals @B'

ELSE '@A Is Not Equal To @B'

END AS AB_Equal,

CASE WHEN @hA = @hB

THEN '@hA Equals @hB'

ELSE '@hA Is Not Equal To @hB'

END AS Hash_Equal;

--------------------------------------------

Kind Regards,

Kakaiya

## Agile BI said:

I had one a few jobs ago, unfortunately that was before I started saving all my code/results as I go.

I remember being quite "excited" at the time as I was the more junior guy and demonstrated that the ETL had a fundamental flaw. Was a nice moment given that the company hired in people above me to get the WH built! People who swore such a collision was impossible!!

I don't think I realised the significance at the time but every warehouse I've worked on since uses this idea and I have to be the one that breaks the news that it isn't strictly 100% reliable.

Fortunately the systems I've worked on since don't need to be 100% so that's cool - I guess it comes down to acceptable tolerance.

Sure the chances are very very unlikely but there is still a chance. If you are frequently loading billions of rows then it's funny how often those ultra unlikely events come round!

## mjswart said:

Agile BI, It's best to understand the exact numbers. The probability of collisions occurring are understood by using the birthday paradox calculation (as Michael Coles correctly did above) For SHA-1, the 2^80 is so far above the billions (or even trillions) of rows you're talking about that I'm skeptical that you were using SHA1. i.e. Loading a 800 trillion rows using sha-1 gives about a one in a billion chance of a collision

The odds of a meteor strike are far more likely. So if you're worried about data protection to that extent, build a stronger roof.

## Mike C said:

Hi all,

Thanks for the comments -- I've already gotten a few emails from various people trying to demonstrate SHA-1 collisions, but so far they all suffer from the same fundamental flaw: turns out the binary values they were hashing were exactly the same.

Agile BI - assuming you were actually using SHA-1 hashing (and not MD5 or SQL's CHECKSUM family of functions) I'd be willing to bet there was a flaw in the company's ETL code concerning concatenation of fields or some other detail. Most hash problems in ETL processes concern poor concatenation, poor choice of hash function or a bad hash function implementation.

## Brian F said:

http://cryptome.org/sha1-attacks.htm

seems like that is an example

## Comet said:

Brian F: The example http://cryptome.org/sha1-attacks.htm is an example of a collision of SHA1 reduced to 58 steps. Note that padding rules were not applied to the messages.

The challenge is to find an SHA1 collision with distinct 8KB (or less) non-null bytes using the full 80 steps with padding.

See http://www.itl.nist.gov/fipspubs/fip180-1.htm for discription of SHA1.

## Mike C said:

Hi Brian,

Comet is correct. Wang and Co.'s paper dicusses full SHA-0 hash collision and

"reduced-round" SHA-1 hash collision. The SQL Server HashBytes function uses

full SHA-1, not reduced-round variants. The reduced-round variants are only

useful in an academic context, and are never used in the real world. You might

compare them to riding a bike with one leg tied down: it might demonstrate

weaknesses in your other leg, your ability to balance in that situation, etc.,

but doesn't really give a good indication of how skilled your riding is when

fully using both legs.

Every now and then someone comes up with an interesting theoretical attack, like these guys: http://eurocrypt2009rump.cr.yp.to/837a0a8086fa6ca714249409ddfae43d.pdf. Occasionally these guys withdraw their papers because they figure out they're wrong: http://eprint.iacr.org/2009/259 :)

There's no restriction on the resources you can use to find a collision for this contest. With Wang's papers in hand you have collisions up to the 58th round, which means you only have 22 rounds to go! :)

## Josef Richberg said:

I have done quite a bit of non-academic work with cryptographic hash functions and find using SHA-1 OR AES is quite sufficient for computing hash keys on incoming binary data (.pdfs, word docs, images, excel docs, etc). I would like to see what type of documents the two examples above were that generated a hash. If the binaries were just random values, then I am not sure it works in a real world ETL. If the binaries were actual word docs or .pdfs, then it is of interest, but that boils down to the 1 in 2^80 chance. If that is to close for comfort you can always use the AES which has a 1 in 2^128 chance of collision. With this said, finding a collision for SHA-1 seems more an academic exercise than a real world exercise.