When I started blogging here on sqlblog.com, I intended to write about stuff like T-SQL, performance, and such; but also about data modeling and database design. In reality, the latter has hardly happened so far – but I will try to change that in the future. Starting off with this post, in which I will pose (and attempt to answer) the rather philosophical question of the title: is data modeling an art or a science?
Before I can answer the question, or at least tell you how I think about the subject, we need to get the terms straight. So that if we disagree, we’ll at least disagree over the same things instead of actually agreeing but not noticing. In the next two paragraphs, I’ll try to define what data modeling is and how it relates to database design, and what sets art apart from science in our jobs.
Data modeling and database design
When you have to create a database, you will ideally go through two stages before you start to type your first CREATE TABLE statement. The first stage is data modeling. In this stage, the information that has to be stored is inventoried and the structure of that information is determined. This results in a logical data model. The logical data model should focus on correctness and completeness; it should be completely implementation agnostic.
The second stage is a transformation of the logical data model to a database design. This stage usually starts with a mechanical conversion from the logical data model to a first draft of the implementation model (for instance, if the data model is represented as an ERM diagram but the data has to be stored in an RDBMS, all entity and all many-to-many relationships become tables and all attributes and 1-to-many relationships become columns). After that, optimization starts. Some of the optimizations will not affect the layout of tables and columns (examples are choosing indexes, or implementing partitioning), but some other optimizations will do just that (such as adding a surrogate key, denormalizing tables, or building indexed views to pre-aggregate some data). The transformation from logical data model to database design should focus on performance for an implementation on a specific platform. As long as care is taken that none of the changes made during this phase affect the actual meaning of the model, the resultant database design will be just as correct and complete (or incorrect and incomplete) as the logical data model that the database design is based on.
Many people prefer to take a shortcut by combining these two stages. They produce a data model that is already geared towards implementation in a specific database. For instance by adding surrogate keys right into the first draft of the data model, because they already know (or think) that they will eventually be added anyway. I consider this to be bad practice for the following reasons:
- It increases the chance of errors. When you have to focus on correctness and performance at the same time, you can more easily lose track.
- It blurs the line. If a part of a data model can never be implemented at acceptable performance, an explicit decision has to be made (and hopefully documented) to either accept crappy performance or change the data model. If you model for performance, chances are you’ll choose a better performing alternative straight away and never document properly why you made a small digression from the original requirements.
- It might have negative impact on future performance. The next release of your DMBS, or maybe even the next service pack, may cause today’s performance winner to be tomorrows performance drainer. By separating the logical data model from the actual database design, you make it very easy to periodically review the choices you made for performance and assess whether they are still valid.
- It reduces portability and maintainability. If, one day, your boss informs you that you need to port an existing application to another RDBMS, you’ll bless the day you decided to separate the logical data model from the physical database design. Because you now only need to pull out the (still completely correct) logical data model, transform again, but this time apply optimization tricks for the new RDBMS. And also, as requirements change, it is (in my experience) easier to identify required changes in an implementation-independent logical data model and then move the changes over to the physical design, than to do that all at once if only the design is available.
- It may lead to improper choices. More often that I’d like, I have seen good modelers fall victim to bad habits. Such as, for instance, adding a surrogate key to every table (or entity or whatever) in the data model. But just because surrogate keys are often good for performance (on SQL Server that is – I wouldn’t know about other DBMS’s) doesn’t mean they should always be used. And the next step (that I’ve witnessed too!) is forgetting to identify the real key because there already is a key (albeit a surrogate key).
Art and science
For the sake of this discussion, “art” (work created by an artist) is the result of some creative process, usually completely new and unique in some way. Most artists apply learned skills, though not always in the regular way. Artists usually need some kind of inspiration. There is no way to say whether a work of art is “good” or “bad”, as that is often in the eye of the beholder – and even if all beholders agree that the work sucks, you still can’t pinpoint what exactly the artist has done wrong. Examples of artists include painters, composers, architects, etc. But some people not usually associated with art also fit the above definition, such as politicians, blog authors, or scientists (when breaking new grounds, such as Codd did when he invented the relational model). Or a chef in a fancy restaurant who is experimenting with ingredients and cooking processes to find great new recipes to include on the menu.
“Science” does NOT refer to the work of scientists, but to work carried out by professionals, for which not creativity but predictability is the first criterion. Faced with the same task, a professional should consistently arrive at correct results. That doesn’t imply that he or she always will get correct results, but if he or she doesn’t, then you can be sure that an error is made and that careful examination of the steps taken will reveal exactly what that error was. Examples of professionals include bakers, masons, pilots, etc. All people that you trust to deliver work of a consistent quality – you want your bread to taste the same each day, you want to be sure your home won’t collapse, and you expect to arrive safely whenever you embark a plane. And a regular restaurant cook is also supposed to cook the new meal the chef put on the menu exactly as the chef instructed.
Data modeling: art or science?
Now that all the terms have been defined, it’s time to take a look at the question I started this blog post with – is data modeling an art or a science? And should it be?
To start with the latter, I think it should be a science. If a customer pays a hefty sum of money to a professional data modeler to deliver a data model, then, assuming the customer did everything one can reasonably expect1 to answer questions from the data modeler, the customer can expect the data model to be correct2.
1 I consider it reasonable to expect that the customer ensures that all relevant questions asked by the data modeler are answered, providing the data modeler asks these questions in a language and a jargon familiar to the customer and/or the employees he interviews. I do not consider it reasonable to expect the customer (or his employees) to learn language, jargon, or diagramming style used by data modelers.
2 A data model is correct if it allows any data collection that is allowed according to the business rules, and disallows any data collection that would violate any business rule.
Unfortunately, my ideas of what data modeling should be like appear not to be on par with the current state of reality. One of the key factors I named for “science” is predictability. And to have a predictable outcome, a process must have a solid specification. As in, “if situation X arises, you need to ask the customer question Y; in case of answer Z, add this and remove that in the data model”. Unfortunately, such exactly specified process steps are absent in most (and in all commonly used!) data modeling methods. However, barring those rules, you have to rely on the inspiration of the data modeler – will he or she realize that question Y has to be asked? And if answer Z is given, will the modeler realize that this has to be added and that has to be removed? And if that doesn’t happen, then who’s to blame? The modeler, for doing everything the (incomplete) rules that do exist prescribe, but lacking the inspiration to see what was required here? Or should we blame ourselves, our industry, for allowing ourselves to have data modeling as art for several decades already, and still accepting this as “the way it is”?
Many moons ago, when I was a youngster that had just landed a job as a PL/I programmer, I was sent to a course for Jackson Structured Programming. This is a method to build programs that process one or more inputs and produce one or more outputs. Though it can be used for interactive programs as well, it’s main strength is for batch programs, accessing sequential files. The course was great – though the students would not always arrive at the exact same design, each design would definitely be either correct, or incorrect. All correct designs would yield the same end result when executed against the same data. And for all incorrect designs, the teacher was able to pinpoint where in the process an error was made. For me, this course changed programming from art into science.
A few years later, I was sent to a data modeling course. Most of the course focused on how to represent data models (some variation of ERM was used). We were taught how to represent the data model, but not how to find it. At the end, we were given a case study and asked to make a data model, which we would then present to the class. When we were done and the first model was presented, I noticed some severe differences from my model – differences that would result in different combinations of data being allowed or rejected. So when the teacher made some minor adjustments and then said that this was a good model, I expected to get a low note for my work. Then the second student had model that differed from both the first and my model – and again, the teacher mostly agreed to the choices made. This caused me to regain some of my confidence – and indeed, when it was my turn to present my, once again very different, model, I too was told that this was a very good one. So we were left with three models, all very different, and according to the instructor, they were all “correct” – and yet, data that would be allowed in the one would be rejected by the other. So this course taught me that data modeling was not science, but pure art.
This was over a decade ago. But has anything changed in between? Well, maybe it has – but if so, it must have been when I was not paying attention, for I still do not see any mainstream data modeling method that does provide the modeler with a clear set of instructions on what to do in every case, or the auditor with a similar set of instructions to check whether the modeler did a great job, or screwed up.
Who’s to blame?
Another difference between art and science is the assignment of blame. If you buy a bread, have a house built, or embark on a plane, then you know who to blame if the bread is sour, if the house collapses, or if the plane lands on the wrong airport. But if you ask a painter to create a painting for your living and you don’t like the result, you can not tell him that he screwed up – because beauty is truly a matter of taste.
Have you ever been to a shop where all colors paint can be made by combining adequate amounts of a few base colors? Suppose you go to such a shop with a small sample of dyed wood, asking them to mix you the exact same color. The shopkeeper rummages through some catalogs, compares some samples, and then scribbles on a note: “2.98 liters white (#129683), 0.15 liters of cyan (#867324), and 0.05 liters of red (#533010)”. He then tells you that you have to sign the note before he can proceed to mix the paint. So you sign. And then, once the paint has been mixed, you see that it’s the wrong color – and the shop keepers then waves the signed slip of paper, telling you it’s “exactly according to the specification you signed off”, so you can’t get your money back. Silly, right?
And yet, in almost every software project, there will be a moment when a data model is presented to the customer, usually in the form of an ERM diagram or a UML class diagram, and the customer is required to sign off for the project to continue. This is, with all respect, the same utter madness as the paint example. Let’s not forget that the customer is probably a specialist in his trade, be it banking, insurance, or selling ice cream, but not in reading ERM or UML diagrams. How is he supposed to check whether the diagram is an accurate representation of his business needs and business rules?
The reason why data modelers require the customer to sign off the data model is, because they know that data modeling is not science but art. They know that the methods they use can’t guarantee correct results, even on correct inputs. So they require a signature on the data model, so that later, when nasty stuff starts hitting the fan, they can wave the signature in the customer’s face, telling him that he himself signed for the implemented model.
In the paint shop, I’m sure that nobody would agree to sign the slip with the paint serial numbers. I wouldn’t! I would agree, though, to place my signature on the sample I brought in, as that is a specification I can understand. Translating that to paint numbers and quantities is supposed to be the shopkeepers’ job, so let him take responsibility for that.
So, I guess the real question is … why do customers still accept it when they are forced to sign for a model they are unable to understand? Why don’t they say that they will gladly sign for all the requirements they gave, and for all the answers they provided to questions that were asked in a language they understand, but that they insist on the data modeler taking responsibility for his part of the job?
Maybe the true art of data modeling is that data modelers are still able to get away with it…