Earlier today I posted the following question on Twitter:
Foreign keys in a data warehouse. Yes or no? This discussion is looming at work, i know my position but am interested in what others think.
Specifically, I wanted to know if people were in favour of creating foreign keys from their fact to to their dimension tables or not.
To say it prompted a good response is a bit of an understatement, 38 responses so far and counting. Here are some of those responses:
It certainly seems to be an emotive subject and its clear (to me) that there’s no correct answer, just lots of opinions. That’s a good thing. The majority of responders appeared to be of the opinion that a data warehouse should contain foreign keys and that is my position too. In this blog post I want to outline why I believe that one should create foreign keys from a fact table to its dimension tables:
Of course, this is the main reason why foreign keys exist – to protect the integrity of your data. I see no reason not to use them for this purpose in a data warehouse. The main argument that I see going against is that with a sufficiently robust ETL solution it shouldn’t be necessary. That is true but I would counter with “how do you know that your ETL solution is suitably robust?” I don’t think its possible to anticipate every eventuality that may arise and for that reason I like the safety net that foreign keys provide. I liked Devin Knight’s response here, foreign keys breed confidence.
Communicate business logic
When I join a project that has a database in place the first thing I do is try and understand the data model – to do that I go and look at the foreign keys in that database. Understanding the dependencies between entities is crucial in any data model and the best means of communicating those is via foreign keys. If I encounter a database that is bereft of foreign keys then my heart sinks a little.
Foreign keys can, in some circumstances, be beneficial in improving query performance. Take a read of A Check and Foreign Key Constraint Improves Query Performance by Sarvesh Singh or Do Foreign Key Constraints Help Performance? by Grant Fritchey.
I am a big fan of generating ETL code where possible and foreign keys can be invaluable when doing so.
As I said there is no correct answer here so if you have any opinions, either agreeing or disagreeing, I look forward to reading your thoughts in the comments below.
Mark Stacey’s comment prompted an interesting digression into talking about surrogate keys for denoting unknown members and this is something I have strong opinions on too:
In short, I don’t like the practice of using “–1” as the surrogate key for an unknown member. My reasoning is simple, I don’t like giving meaning to something that is supposed to be meaningless. How then should we indicate which is the unknown member? I propose a single-row table that includes a column for each dimension table, each with a foreign key to the unknown member in the respective dimension table.
Moreover I don’t like the practice of starting surrogate key counters from 1; the first value available for the integer datatype in SQL Server is –2147483648 so why not start from that?
I discuss both of these issues in much more depth at Considering surrogate keys for Unknown members.
Again if you have any thoughts on these subjects please put them in the comments. If nothing else I find it both fun and educational to debate this stuff.
Update, Chris Adkin posted a comment below that contained a link to Microsoft's own guidance on building datawarehouses where it is stated:
"Many physical designs for data warehouses follow the star schema but do not completely specify the relationships between the fact and dimension tables, as mentioned earlier for foreign key constraints, for instance. Without the foreign key constraints explicitly specified, SQL Server must depend on heuristics to detect star schema query patterns."
Chris' take on this:
"So, assuming we are talking about a Kimball DW, there is most definitely value in using foreign key contstraints as this provides a fail safe for the heuristics getting it wrong."