In my last post in this series, I talked about using user-defined data types (alias types). Today I wanted to discuss many of the ways in which people subject their date and time columns to very inappropriate query methodologies.
It's very easy to say, "Hey, don't do the wrong thing!" Not so easy to actually accomplish, right? In general, yes,
I agree. But I see such frequent abuse of DATETIME columns in range
queries that I felt it deserved some treatment.
The long, long, long laundry list of offenses (apologies in advance)
The most frequent
faux pas I see is when someone uses regional date formats. For
example, they want all the rows from a particular day. First they try:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn = '10/11/2009';
|
The
first problem there is, what if the system has British regional
settings or the language is set to French? Is that October 11th or
November 10th? I wrote the query, and I don't even know! It would be
a shame to pull data from the wrong month, and not even notice. Much
better to use an unambiguous date format; in spite of what --CELKO--
will try to force you to believe, the only truly safe formats for
date/time literals in SQL Server, at least for DATETIME and SMALLDATETIME, are:
YYYYMMDD YYYY-MM-DDThh:nn[:ss[:mmm]]
|
As
an example, even if you try to use the seemingly unambiguous
YYYY-MM-DD, this can break under certain scenarios -- such as when the
user's language settings are set to French:
SET LANGUAGE FRENCH; GO SELECT CONVERT(DATETIME, '2009-10-13');
|
Result:
Le paramètre de langue est passé à Français. Msg 242, Level 16, State 3, Line 1 La conversion d'un type de données varchar en type de données datetime a créé une valeur hors limites. |
For
those of you not fluent in Français, that essentially says (in my best Quebec accent), "There is
no month 13, dummy!" This is because in French that date format is interpreted as YYYY-DD-MM. (For some background on the attempts we've made to deprecate this interpretation, see Connect #290971.)
As opposed to YYYY-MM-DD, YYYYMMDD will never break. If you decide to use
any other format for your date string literals, at least for DATETIME
and SMALLDATETIME types, you are leaving yourself open to errors or
incorrect data should a user have different session settings, or should
the application be moved to servers with different settings. In SQL Server 2008, the new types are a little more insulated from user or machine settings; still, I use YYYYMMDD for consistency and to be safe.
When
the user fixes that and passes in a proper string literal format, there
is a problem with this query in most situations, since DATETIME and
SMALLDATETIME columns have a time component. Unless you always strip
out the time when entering data (or use a computed column that does
this for you), this query should yield few, if any, rows:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn = '20091011';
|
This is because the data looks like this:
DateColumn ----------------------- 2009-10-11 00:14:32.577 2009-10-11 04:31:16.465 2009-10-11 08:45:57.714
|
What the query above is actually asking is:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn = '2009-10-11T00:00:00.000';
|
So there should be no surprise that no results are returned, since none of the values match that criteria.
What
does the user do next? The same thing I did the first time I came
across this problem. Convert the left side of the equation to a
string, stripping off the time component:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE CONVERT(CHAR(8), DateColumn, 112) = '20091011';
|
NOW
I can get my data, right? Well, yes, you can get your data all right.
But now you've effectively eliminated the possibility of SQL Server
taking advantage of an index. Since you've forced it to build a
nonsargable condition, this means it will have to convert every single
value in the table to compare it to the string you've presented on the
right hand side. Another approach users take is:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn BETWEEN '20091011' AND '20091012';
|
Well,
this approach is okay, as long as you don't have any rows that fall on
midnight at the upper bound - which can be much more common if parts of
your application strip time from date/time values. In that case, this
query will include data from the next day; not exactly what was
intended. In some cases, that *is* what is intended: some people think the above query should return all the rows from October 11th, and also all the rows from October 12th. Remember that this query can be translated to one of the following, without changing the meaning:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn BETWEEN '2009-10-11T00:00:00.000' AND '2009-10-12T00:00:00.000'; -- or SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= '2009-10-11T00:00:00.000' AND DateColumn <= '2009-10-12T00:00:00.000';
|
(Note that in the second example, that is greater than or equal to the first variable and less than or equal to the second variable.) This means that you will return rows from October 12th at exactly midnight, but not at 1:00 AM, or 4:00 PM, or 11:59 PM.
Then the user tries this, so they can still use BETWEEN and save a few key strokes:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn BETWEEN '20091011' AND '2009-10-11T23:59:59.997'; -- or SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn BETWEEN '20091011' AND DATEADD(SECOND, -1, '20091012');
|
These are no good either. If the data type of the column is SMALLDATETIME,
the comparison is going to round up, and you *still* might include data
from the next day. For the second version, if the data type of the column is DATETIME, there is still the possibility that you are going to miss rows that have a time stamp between 11:59:59 PM and 11:59:59.997 PM. Probably not many, but if there is even one, your data is no longer accurate.
(Note that if you are using the DATE data type in SQL Server 2008,
or can guarantee that you always remove the time component from the
column, BETWEEN is okay. But for consistency, I still stay away from BETWEEN.)
Another thing I see a lot is when people want a range like a month or a year. Can you believe that people write code like this:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn LIKE '200910%';
|
The
problem with this is that, even while SQL Server will implicitly
convert DateColumn to a string for you, it does *not* convert it to
CHAR(8) with style 112, which would be required for this wildcard
search to work. (You can see what it will do "for you" when you try PRINT CURRENT_TIMESTAMP;.) So maybe they meant to do it this way:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE CONVERT(CHAR(8), DateColumn, 112) LIKE '200910%';
|
But
this is still a bad idea because, like above, this creates a
nonsargable condition, and prevents an index on DateColumn from being
utilized. And finally, how about this one:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DATEPART(YEAR, DateColumn) = 2009 AND DATEPART(MONTH, DateColumn) = 10;
|
This looks more like something you would see in OLAP, where you actually have measures and dimensions that will allow you to query the data this way - efficiently. In the OLTP world, this type of nonsargable query is not going to perform any better than any of the others above, and it makes parameter passing and validation more complex as well (imagine the leap year validation you'd require for a date passed in as year, month, day when the date is February 29th).
I'm not making *ANY* of these up; I have seen them all out there in the wild, either in code I've reviewed, systems I've inherited, or questions I've seen on the newsgroups or forums.
The best approach, IMHO
In order to make best possible use of indexes, and to avoid capturing too few or too many rows, the best possible way to achieve the above query is:
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= '20091011' AND DateColumn < '20091012';
|
Hopefully
the queries are not being written this way, and the data is actually
passed to the statement as a properly typed variable. When you can
help SQL Server avoid implicit conversions, you should do so. If you are intending to allow just one day at a time in your query, you could write a stored procedure like this:
CREATE PROCEDURE dbo.GetLogCountByDay @date SMALLDATETIME AS BEGIN SET NOCOUNT ON;
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= @date AND DateColumn < DATEADD(DAY, 1, @date); END GO EXEC dbo.GetLogCountByDay @date = '20091011';
|
Why don't I use < (@date + 1) there? To enforce a best practice. I'll
admit, I've used the lazy DATEADD shorthand for years. However, I now
consider that a bad habit too, as it breaks with the new DATE data
types in SQL Server 2008:
DECLARE @d DATETIME2(7) = SYSDATETIME(); SELECT @d + 1;
|
Result:
Msg 206, Level 16, State 2, Line 3 Operand type clash: datetime2 is incompatible with int
|
If you want to support a range of dates, then the change is minor:
CREATE PROCEDURE dbo.GetLogCountByDateRange @StartDate SMALLDATETIME, @EndDate SMALLDATETIME AS BEGIN SET NOCOUNT ON;
SELECT COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= @StartDate AND DateColumn < DATEADD(DAY, 1, @EndDate); END GO EXEC dbo.GetLogCountByDateRange @StartDate = '20091011', @EndDate = '20091015';
|
And if you wanted to return the counts for each day, you could do this (assuming you have a Numbers table that starts at 1):
CREATE PROCEDURE dbo.GetDailyLogCountByMonth @Month SMALLDATETIME AS BEGIN SET NOCOUNT ON;
WITH [days] AS ( SELECT [day] = DATEADD(DAY, [Number]-1, @Month) FROM dbo.Numbers WHERE [Number] <= DATEDIFF(DAY, @Month, DATEADD(MONTH, 1, @Month)) ) SELECT d.[day], COUNT(t.DateColumn) FROM [days] AS d INNER JOIN dbo.SomeLogTable AS t ON t.DateColumn >= d.[day] AND t.DateColumn < DATEADD(DAY, 1, d.[day]) GROUP BY d.[day] ORDER BY d.[day]; END GO EXEC dbo.GetDailyLogCountByMonth @Month = '20091001';
|
Performance comparison
Let's compare a couple of these approaches. First, we need to build a table and some procedures (this looks like a LOT of code, but it took about 6 seconds to create on my VM):
CREATE DATABASE DateTesting; GO USE DateTesting; GO
CREATE TABLE dbo.SomeLogTable ( DateColumn DATETIME ); GO CREATE CLUSTERED INDEX x ON dbo.SomeLogTable(DateColumn); GO SET NOCOUNT ON;
-- populate a numbers table with 500K rows:
DECLARE @UpperLimit INT; SET @UpperLimit = 500000;
WITH n AS ( SELECT x = ROW_NUMBER() OVER (ORDER BY s1.[object_id]) FROM sys.objects AS s1 CROSS JOIN sys.objects AS s2 CROSS JOIN sys.objects AS s3 CROSS JOIN sys.objects AS s4 ) SELECT [Number] = x INTO dbo.Numbers FROM n WHERE x BETWEEN 1 AND @UpperLimit; GO CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]); GO
-- get 500K pretty evenly distributed rows into the log table:
INSERT dbo.SomeLogTable(DateColumn) SELECT DATEADD(SECOND, -[Number], DATEADD(MINUTE, ([Number]), '20090901')) FROM dbo.Numbers; GO
-- create a procedure for getting a day's log count
-- good way: GO CREATE PROCEDURE dbo.Good_LogCountByDay @date SMALLDATETIME AS BEGIN SET NOCOUNT ON; DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= @date AND DateColumn < DATEADD(DAY, 1, @date); END GO
-- bad way #1: GO CREATE PROCEDURE dbo.Bad_LogCountByDay_1 @date SMALLDATETIME AS BEGIN SET NOCOUNT ON; DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE CONVERT(CHAR(8), DateColumn, 112) = @date; END GO
-- bad way #2: GO CREATE PROCEDURE dbo.Bad_LogCountByDay_2 @year INT, @month INT, @day INT AS BEGIN SET NOCOUNT ON; DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE DATEPART(YEAR, DateColumn) = @year AND DATEPART(MONTH, DateColumn) = @month AND DATEPART(DAY, DateColumn) = @day; END GO
-- create procedures for getting a month's log count
-- good way: GO CREATE PROCEDURE dbo.Good_LogCountByMonth @Month SMALLDATETIME AS BEGIN SET NOCOUNT ON;
DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE DateColumn >= @Month AND DateColumn < DATEADD(MONTH, 1, @Month); END GO
-- bad way #1: GO CREATE PROCEDURE dbo.Bad_LogCountByMonth_1 @Month SMALLDATETIME AS BEGIN SET NOCOUNT ON;
DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE CONVERT(CHAR(8), DateColumn, 112) LIKE CONVERT(CHAR(6), @Month, 112) + '%'; END GO
-- bad way #2: GO CREATE PROCEDURE dbo.Bad_LogCountByMonth_2 @year INT, @month INT AS BEGIN SET NOCOUNT ON; DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.SomeLogTable WHERE DATEPART(YEAR, DateColumn) = @year AND DATEPART(MONTH, DateColumn) = @month; END GO
|
Just by looking at it, you probably have a good idea how this going
to end. All the same, we can test each set of stored procedures in two different ways:
(a) Getting the data for a single day
- First, let's just do a one-to-one-to-one comparison of the execution plan, just to see what we get:
EXEC dbo.Good_LogCountByDay @date = '20091005'; EXEC dbo.Bad_LogCountByDay_1 @date = '20091005'; EXEC dbo.Bad_LogCountByDay_2 @year = 2009, @month = 10, @day = 5;
|
As expected, the "Good" version of the procedure has a far more favorable plan, using a clustered index seek as opposed to a clustered index scan. Here is how the plans compare (click to enlarge):

- In case the differences in the plan do not highlight the performance implications, let's run each procedure 1000 times, to see how long it takes. Remember to turn off the "Include Actual Execution Plan" option!
SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Good_LogCountByDay @date = '20091005'; GO 1000 SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Bad_LogCountByDay_1 @date = '20091005'; GO 1000 SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Bad_LogCountByDay_2 @year = 2009, @month = 10, @day = 5; GO 1000 SELECT CURRENT_TIMESTAMP; GO |
Results:
2009-10-16 12:05:06.123 2009-10-16 12:05:07.063 (~1 second) 2009-10-16 12:11:09.650 (~6 minutes, 2 seconds) 2009-10-16 12:12:46.197 (~1 minute, 46 seconds)
|
So, clearly the date range query is far superior to the other two. And while the execution plans for the two "bad" versions of the procedure showed that their costs should be roughly equivalent, in reality it turns out that the procedure that handles the CONVERT() on the left-hand side is far more costly, at least in terms of duration, than the version that uses DATEPART() to extract the year, month and day.
(b) Getting the data for a month
- Let's turn "Include Actual Execution Plan" back on, and compare the plans for the next set of procedures:
EXEC dbo.Good_LogCountByMonth @month = '20091001'; EXEC dbo.Bad_LogCountByMonth_1 @month = '20091001'; EXEC dbo.Bad_LogCountByMonth_2 @year = 2009, @month = 10;
|
We see a very similar result to the above, where the "good" procedure uses a clustered index seek, and the "bad" procedures use a scan (click to enlarge):

- Now, let's try these procedures 1000 times each, and measure how long they take (again, you don't want to run these loops with execution plan enabled):
SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Good_LogCountByMonth @month = '20091001'; GO 1000 SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Bad_LogCountByMonth_1 @month = '20091001'; GO 1000 SELECT CURRENT_TIMESTAMP; GO EXEC dbo.Bad_LogCountByMonth_2 @year = 2009, @month = 10; GO 1000 SELECT CURRENT_TIMESTAMP; GO
|
Results:
2009-10-16 12:16:59.727 2009-10-16 12:17:04.383 (~5 seconds) 2009-10-16 12:21:40.640 (~4 minutes, 36 seconds) 2009-10-16 12:23:13.950 (~1 minute, 33 seconds)
|
Again we see that the date range query performs quite well compared to the other two, and that the CONVERT() version takes far longer to complete than the DATEPART() version. I guess if you are going to continue to use a "bad" approach, you can at least easily determine which is the lesser of two evils. :-)
It is not surprising that the performance aspect of the "good" approach shows significant improvement over the nonsargable versions. I could probably also demonstrate cases where you accidentally retrieve too few rows, or too many rows -- but this article seems to be getting a little long already, so I'll leave the data correctness discussion for another day.
Don't forget to clean up:
USE [master]; GO DROP DATABASE DateTesting; GO
|
A few other tidbits
As an aside, if you only want whole dates, make
sure your input validation is functional and that users know what
format to enter. Nothing can go right if you let users enter freeform
dates and some of them enter d/m/y and others enter m/d/y. Safest to
use a calendar control / date picker, then you can dictate exactly what
the format is. And to be safe, sanitize the input by converting it to
midnight, e.g.:
SET @DateInput = DATEDIFF(DAY, 0, @DateInput);
|
If you want more control over beginning and end ranges (let's say for a month-based procedure, where you want to report on whole months), you can do this:
SELECT @StartDate = DATEADD(DAY, 1-DAY(@d), DATEDIFF(DAY, 0, @StartDate)), @EndDate = DATEADD(MONTH, 1, @StartDate);
|
Finally, I have seen stored procedures
where DATETIME values are passed in as CHAR(8) or CHAR(10). Don't
pass a date into a stored procedure using a string-based parameter: always use
properly typed parameters. If your client-side validation is broken or
being bypassed, this can cause problems you can stomp further up the chain by using the correct data type in the first place.
Summary
The main take-away points I was trying to get across in this post are:
- avoid ambiguous formats for date-only literals;
- avoid BETWEEN for range queries against DATETIME, SMALLDATETIME, DATETIME2, and DATETIMEOFFSET;
- avoid calculations on the left-hand side of the WHERE clause; and,
- avoid treating dates like strings.
For a lot more helpful information on date and time, see Tibor Karazsi's article, "
The ultimate guide to the datetime datatypes."
Over the past couple of weeks, I have developed a series of "Bad habits to kick" articles, in an
effort to motivate people to drop some of the things that I hate to see
when I inherit code. I have several more ideas in development, and I'll gladly take suggestions for future articles, but this should be the last post in the series for at least a few days. I hope the series has been interesting.