Spoiler: TPH is an evil trap
In the last installment I pleaded for the equal treatment of the database schema and object model when implementing an application with a database, on the grounds that the failure of either means the failure of the whole system. I think it follows that a decent application ought to have both a good quality database schema and a fabulous code base. If the database is an embarrassment, on relational grounds, then it compromises the whole, just as if the app code is an embarrassment on OO grounds.
(I am working from the assumption that we do try to create good quality software; there's the whole "sales driven design" notion that says we can't ever create anything of value because it's too costly; I reject that notion as ... well, boring. There would not be much to write here if we simply give up trying to make anything good.)
So, for the sake of argument let's assume we've picked an O/R Mapper to help create the app by automating much of the plumbing of repetitive SQL queries, and moving data to and fro. I am going to invoke NHibernate, because it seems decent, and mature, and so on. I think Entity Framework has the potential to turn into something great, if built out correctly in upcoming versions, yet it seems so far to be over-inflated with a lot of marketing noise, while it catches up to other solutions. Then again, I am not really an expert with ORM, and others could to a better job of comparing. Assume an appropriate level of abstraction is maintained between the database and the application by some implementation of either views, functions and stored procs, or by a nice, clean data access layer, which you can do with a mapper if you implement it correctly.
The most trivial bits of O/R mapping aren't that interesting - a simple atomic object corresponds to a simple row in a simple table using a surrogate key. A collection like List<T> in the app is a collection of such objects representing a subset of the rows from a table. "Product Object" implies "select ProductID, Name, Price from Products where ProductID = n." That's all quite basic, and something one would probably create manually using traditional coding techniques. The place where things get interesting and hairy is in the more complicated cases where the object model and the relational model don't match up so neatly. I'll cut straight to the chase: polymorphism is where things get ugly, and that's where we should start to be careful. And if you look up polymorphism related to popular ORM's, an issue rears its ugly head immediately: TPH or Table Per Heirarchy persistence model for polymorphism. Consider these descriptions:
NHibernate Inheritance Mapping docs
Oren Eini (Ayende Rahien)'s very concise blog description:
Alex James excellent version pertaining to Entity Framework
I don't have anything against the smart, hard-working folks that put this stuff together, nor am I anti-ORM in any way. Still, amazingly, all these documents and blogs about both NHibernate and Entity Framework generally give about equal billing to three methods for representing polymorphic objects in a relational schema -- even though one, TPH (which, horribly, seems always to come first in every such list) violates the most basic best practice for relational database design. It's not normalized even in a trivial sense, it makes it impossible to enforce integrity on the data, and what's most "awesome:" it is virtually guaranteed to perform badly at a large scale for any non-trivial set of data.
Luckily, there is a workable implementation in most of these tools, Table per Type. What's missing from the docs and recommendations around TPH is some big, bold print that says,
"Never do TPH, because it's a horrible nightmare in the relational world, a violation of 30 years of best practice in database design, and probably makes a DBA somewhere cry inside. Always do TPT. TPH is presented here only as a twisted curiosity, and is to be avoided."
Lack of that warning about TPH means people might be tempted to actually use it, which is unfortunate.
I imagine all DBA's will hate TPH. Some Devs I am certain will complain that the DBA's are being obstinate or unreasonable, but here's the thing: TPH is a bad idea, and the DBA's know that. If the shoes and feet were reversed, so to speak, where the DBA's were demanding that Developers violate some fundamental programming principles in the design of their code -- without even really understanding it -- they might be a bit miffed as well, and would have valid reasons. The sad thing is that blunders like TPH can sour DBAs on the entire notion of ORM, based simply on the notion that ORM tools can seem to encourage bad database design. It really ought to be the other way around. But let's set the cultural argument aside and look at technical reasons why this method is to be avoided.
I worked with an ISV-provided piece of software for years that, by coincidence, had important, busy tables designed on the TPH principle. Ironically it didn't use anything as elaborate as an O/R mapper; the developers had just adopted the same notion for polymoprhism. In this case the application has an accounting function, among other things. Each financial transaction in the accounting part of the system, of every type (charges, payments, checks, invoices, etc.) is stored in one massive Transactions table. The table is 120 columns wide and practically every column is nullable, and might or might not have a value depending on what type of transaction was indicated - you guessed it - in the "discriminator" column. There are 16 types of transactions, if I recall. This massive table also stored both posted and un-posted transactions (hey, they're all "transactions," right?). It also contained the header records for batches of transactions that were entered at the same time. In order to "post" un-posted transactions, an elaborate process would change the discriminator/record type and set various dates and flags in the table, through a ... erm, "complex" ... procedure. A massive amount of this huge, largely empty table (nulls take space, even without values) had to be kept in cache on the SQL Server all the time. It was basically filling the RAM on the server with blank space, and the odd value here or there at random. Ironically, this design provided no advantage whatsoever. It's wasn't like there was a trade off of this advantage for that disadvantage. It was simply, purely horrible.
The application did not perform well, at all, at scale, and was just a constant source of expense and pain for us. Not because of some purist ideal of normalization, but because all the financial data was mixed into a great disorganized pile. The frustrating thing about it was that had it been based on a reasonable schema, there would have been no performance issue at all, and it would easily have worked for hundreds or a few thousands of end users on fairly modest hardware.
Of the docs and blog posts I have been able to find on this topic, most give only a cursory overview, along the lines of "you might want to do this, or that, it's not that important." Alex James does more in the way of explanation about how they work and why to pick one. Interestingly, there's a table on his blog page ranking them by different criteria, and the TPT, relational-friendly option he marks "winner" in every category except performance. I had the privilege of attending a session at the PASS Summit that he gave, which was excellent, and he is a very sharp guy with an understanding of the database side of this issue. Yet I do have to wonder about the "performance" argument, as it's not been my experience at all. Maybe I am jumping to conclusions, but I wonder if it's the old myth about "avoid joins to enhance performance" coming back around.
I suspect this: for a trivial case, where you have five or six object types that are quite similar, then stuffing them into one table might avoid some joins - as long as the cost of the interleaving of rows of different types, which reduces the efficacy of indexes and takes more memory and slows many types of queries down, doesn't cancel the intended optimization. However for real world cases with large sets of data, design-by-join-avoidance isn't really a good strategy. Better performance is gained, in most cases, by the partitioning of data into smaller, more highly sorted and organized sets. Yes, the server has to perform joins, but the cost of joins is much less than the cost of brute-force searching through a high volume of interleaved, mixed, sparse rows of different "types" in one massive table.
Perhaps a clearer analogy: suppose we had one uber-table, called "Object." Everything inherits from Object, so why not one massive, ultimate Table Per Heirarchy with Object as the root. We just put a discriminator that indicates what type of object each row represents, and a nullable column for every property of every object. Simple! The issue is that everything, in this case literally everything, is mixed together. Should perform swimmingly, because there are no joins!
But wait. There are other issues over and above joins. How does the server locate data in this table? How much cache would be required? What does an index look like? Clearly the server would perform better if the data were partitioned sensibly, into smaller groups, so that it could find related information. The answer, of course, is tables. Having multiple tables not just convenient, it's a performance optimization. In fact, there are many performance issues that can be solved by adding tables, not removing tables - most of the time if unrelated rows can be partitioned into separate sets (tables), it makes it easier and faster for the server to locate them.
So if performance is the one argument for TPH, the one that Alex says "trumps these other concerns," then I have to say I'm not quite buying it. In fact, I can't see why anyone would ever want to do TPH for anything other than a small toy of a system. In fact, can we turn that thing off in any of these O/R mappers? Where's the OFF button?
Back to the larger point, which is, "what are the major design issues for connecting apps to data?" I think the points I want to make are: a. the database schema still deserves careful attention and design, even in 2010, and b. the use of modern methodologies (agile, ORM, etc.) can be aligned with best practices but you have to be intentional about it. You have to be able to say things like, "TPH sounds neat-o, but it violates normalization fundamentals. What's up with that? That seems like a bad idea, even according to my olde-time gray-haired DBA over here. Why do that? What are my other options? Which ones don't make the database suck?" The end goal should be "and" - great database AND great application, no compromises.