In college, as part of my philosophy degree, I took a course
on metaphysics. I can still vividly remember the first class. The professor
presented us with a simple enough scenario:
Consider a wooden boat, whose construction has just finished.
Now fast-forward several months. The boat has been well used, and one of the
planks that makes up its hull has gotten worn down and needs to be replaced.
After the plank is replaced, is it still the same boat?
The class almost unanimously agreed that it was, indeed, the
same boat. Replacement of a single plank certainly wouldn't change the boat as
a whole, would it? But this wasn't the end of the professor's line of questioning.
What if, he asked, over time every
plank was eventually replaced? Would the boat in this state--being comprised of
an entirely different set of planks than it originally was--still be the same
boat as the one we had considered to begin with?
This question lead to quite a bit more debate than the
first, but the end result was still more or less unanimous, albeit with a
different conclusion: We agreed that something didn't quite feel right about
saying that it was the same boat, if it was composed of entirely different
materials than the original boat. So it probably was not the same boat anymore.
But this conclusion begged one final question, the real question: If the boat wasn't the
same boat after having all of its planks replaced, was there some instant
within the period during which the planks were replaced that it ceased being
the same boat? Did the change happen right as the final plank was replaced? Or
perhaps, was the boat suddenly no longer the same when exactly half of its
planks were swapped out? Or was the change gradual? And can a thing gradually
cease to be the same as it was?
The class again fell into debate, and eventually we came to
the conclusion that we had been wrong to begin with; the boat wasn't the same
boat after the first plank was swapped out. But that was already too late,
because it wasn't even the same boat before
that replacement was made. It wasn't the same boat even the moment after it was
created; the microscopic state of the wood and nails had already changed
somewhat due to airflow and moisture.
The real "aha" moment was realization that there
is a difference between identity and state. Up to and including the period after
all of its original planks were swapped out, we could still point to the boat and
call it by its original name--even while acknowledging that it wasn't the same
boat, physically, it once was. Its state had changed, but it was still
identified the same way.
People who regularly work with databases are already quite
familiar with the differences between identity and state. The idea of an
immutable primary key--considered a best practice by many database designers--is
a perfect example of this. We want to be able to identify our entity instances
even as their state changes. We're also lucky enough to generally track a very
limited number of attributes. Unlike in the boat analogy, where any number of
possible factors down to the submicroscopic level can be argued to change the
state of the boat, in our databases we have a fixed set of criteria. It's easy
to get enough information to confidently say "this instance is no longer
in the same state as it was previously."
And that brings me around to the actual topic of this post:
programmatic concurrency control. Every time I give my talk on designing highly
concurrent database applications--in which I discuss pessimistic, optimistic,
and multivalue concurrency schemes--I seem to get the same question: "What
about updates to different columns? In
that situation do we have a collision?"
Neither in the talk, nor in Expert
SQL Server 2005 Development (the talk is based on material from Chapter 8
of the book) do I address this topic. It's simply not something I thought was
an issue with these or other concurrency schemes, and so I didn't cover it. But
the proof is in what readers and attendees actually want to learn, and I keep
getting this question--so apparently it is
an issue for some people.
For the purpose of this post, I'll start with a summary and
work backward from there: The general answer I've given to this question, and
will continue to give, is that updates of the same row, but to different
columns, should be treated as a collision
in almost every case.
To begin thinking about why this must be true, we should
start with what actually defines an entity/type instance in a database (or
elsewhere). An instance is really nothing more than a specific collection of values corresponding to the attributes
defined by the entity. Note the key word, "specific." Any other
collection of values is a different instance, or at least a change to the
instance. Each instance happens to be uniquely identified based on a certain
subset of these attributes (i.e., its primary key), but the instance cannot be
defined based solely on this key. As an example, a car's VIN uniquely
identifies the car, but doesn't tell you what color it is or whether it has a
crack in the rear windshield.
And that brings us around to concurrency control itself.
What is the purpose of concurrency control solutions, other than to serialize changes made to any given
instance? The point is to reduce
concurrency on a given instance, not to increase it. This helps us avoid
logical traps that otherwise might be extremely difficult to detect.
Take, for example, a table of addresses used by a credit
card firm. The customers of the firm send in address update cards when they
move or need to make changes to their address on file, and these cards wind up
in the hands of data entry clerks whose job it is to input the changes. Today
is a special day, because one customer wasn't sure if he'd already sent a card
in, so he sent two. Alas, these cards wound up in the hands of two different
data entry clerks simultaneously. Let's watch what happens if updates to
different columns are not considered to be a collision.
The customer's initial address on file is:
235 Main Street, Springland, OR 97999
As it happens, 235 Main Street is actually an apartment
complex, and the customer hasn't been getting his bills because the postal
carrier doesn't know his name and no apartment number is listed on the mail.
The customer's update cards both contain requests that his apartment number--Apartment
2--get added to his address on file.
The schema for this table includes an ApartmentNumber
column, and the first data entry clerk uses it, setting its value to "2".
The second clerk, alas, is new to the job and doesn't notice the field on the
data entry user interface. So he updates the AddressLine1 column, setting its
value to "235 Main Street, Apartment 2".
No collision is detected, but the customer still doesn't get
his bills--his address is now rendered by the mailing system as:
235 Main Street Apartment 2 2, Springland, OR 97999
And now the identity thief who happens to live in Apartment
22 is getting the bills. Oops!
While this is certainly a contrived scenario, it should
serve to illustrate the difficulty of coming up with a proper way to avoid
collisions without locking an entire row (or instance). Note also that a
programmatic concurrency control scheme's primary job is to block any
possibility that such collisions can happen, but its other purpose is to give
the user enough information to help avoid problems to begin with. In this case,
had either of the data entry clerks seen the other's update, the invalid data
would never have hit the system. A detected collision must do more than just
keep bad data out--it also should return information about the nature of the
collision, in order to help the user to better do his job.
Concurrency control at the instance level may not be right
for every application, but I've yet to see a great example of where
implementation of a column-based scheme is truly the right choice from a
cost-benefit perspective. A column-based scheme will be much more complex to
implement, may leave logical holes as shown here, and in the vast majority of
cases will not sufficiently improve scalability--really, the only possible
argument in its favor--in order to be warranted.
Like the boat, an instance in a database changes with time. And
like the boat, even the smallest change to an instance effects a new version,
regardless of whether we can still identify it as the same.