THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Aaron Bertrand

Aaron is a senior consultant for SQL Sentry, Inc., makers of performance monitoring and event management software for SQL Server, Analysis Services, and Windows. He has been blogging here at sqlblog.com since 2006, focusing on manageability, performance, and new features; has been a Microsoft MVP since 1997; tweets as @AaronBertrand; and speaks frequently at user group meetings and SQL Saturday events.

Bad habits to kick : choosing the wrong data type

In my last post in this series, I talked about the assumptions many people make about IDENTITY columns.  In this post, I want to focus on choosing data types.  There are several areas where I see frequent mistakes in data type choice, and I'll mention a few here.

Using CHAR / VARCHAR / INT for date / time values

Countless times I have seen character-based data types for columns meant to store date/time information, usually to store their preferred format (e.g. d/m/y) or to strip the date or time from the value.  Even worse is when they use NCHAR or NVARCHAR when we know that dates are not going to include any Unicode characters.  There are several negative consequences to this.  Most importantly, you lose proper validation - with a string, anybody can put "12/42/9999" or "36/8/211" or "foo" into that column.  Depending on what format you store and how you run conversions, you will lose the ability to perform range queries, and you will have definite losses in the area of index operations.  You also lose the ability to reliably perform date math operations like DATEADD(), DATEDIFF() and DATEPART(). 

You lose less in terms of index performance and storage when you store the date (e.g. in YYYYMMDD format) as an INT, but again you will have to be very creative in how you query the table and what types of parameters you pass in to support range queries.  A useful example to point out is how painful it is for us to get useful date/time information out of the sysjobhistory table in msdb.

It is much better to store these values as DATETIME or SMALLDATETIME and either strip off the parts you don't want before insert, or ignore them when you query.  You can also use computed columns (or populate them manually) at insert time, and keep both the original value and the "rounded" value.  In SQL Server 2008, you can also use the new DATE and TIME data types if you are only interested in one component or the other.


Using DATETIME instead of SMALLDATETIME

If you are only interested in the date portion of the value, or don't need the granularity of time below the minute, use SMALLDATETIME instead of DATETIME.  You will save 4 bytes per row and your indexes and range queries will be much more efficient.  In SQL Server 2008, you can use DATE in the "date only" case, and save an additional byte.  Your users and your SAN administrator will thank you.  When I have some time, I will post an article demonstrating some of the gains of using the smaller date/time data types where possible.


Using TIME to store duration

With SQL Server 2008, it may be tempting to store a duration in a column defined with the new TIME data type.  This data type is not meant to store duration but actually a point in time.  A problem with trying to store duration here is, what happens when your duration exceeds 24 hours?  It would be much more useful to store the "duration" in two columns - StartTime and EndTime.  You could add a third column which can be computed, calculating the duration in minutes or seconds or whatever makes sense.  Your choice for the StartTime and EndTime columns could be TIME if you are reasonably confident that durations will always be less than 24 hours and will not cross any midnight boundaries, but more likely they should be SMALLDATETIME or DATETIME. 


Using NVARCHAR to store data like zip or phone

Once in a while I see zip and phone columns defined as NVARCHAR.  I am not sure what phone number can contain Unicode characters, so maybe it is a valid choice for you, but somehow I doubt it.  I also often see these defined far wider than they need to be.  NVARCHAR(50) for phone?  Really, people?  This is how you can tell someone created their table either through the SSMS UI (which offers 50 as the default) or from upsizing Access (which converts all text columns to NVARCHAR).

E-mail is an interesting one.  We use VARCHAR(320), since the local part (username) is limited to 64 characters, and the domain name is limited to 255 characters.  64 + @ symbol + 255 = 320.  In our systems, we do not need to worry about supporting far East e-mail addresses, which can require Unicode characters.  We also have yet to be bitten by someone who has a very long domain name *and* insists on using a sub-domain to exceed 255 characters.  They would also have to use a very large local part which, I guess, is possible, but very unlikely.  When that user comes in and breaks one of our systems, I'll point them to this post.  :-)

Using VARCHAR to store data like name and address

Your applications should be prepared for internationalization, even if right now you are not supporting it at all.  You might think that because you only have American customers or users that you don't need to worry about foreign characters, however there are potential embarrassments lurking if you store a user's name with VARCHAR and lose part of the data.  In most cases you will eventually need to support Unicode, so you are likely just delaying the inevitable by using VARCHAR now.  In SQL Server 2008 R2, you can use data compression to significantly reduce the storage and I/O implications of doing so, since it will essentially treat all non-Unicode characters as if you had used VARCHAR.

Using VARCHAR instead of CHAR

If you know your column is going to store a (relatively) fixed number of bytes, you should consider using CHAR instead of VARCHAR.  You may sometimes lose a little bit due to the padded spaces at the end, but you will often gain this back by not having to worry about row overhead for storing the number of characters.  The basic rule of thumb I have seen out in the wild is that if you are storing fixed size, or variable size less than 10 or 12 characters, use CHAR.  Of course, you need to be sure you understand how the column will be used in the future.  If there is any chance that you will later store more data here, you are probably better off using VARCHAR.  If memory serves me right, you will have much less pain later if you have to increase the size of the column and rebuild the table, if you are going from VARCHAR(x) -> VARCHAR(x+y), since this doesn't impact existing rows the way it would if you were converting CHAR(x) -> CHAR(x+y) or VARCHAR(x+y).

An example of where using CHAR ended up being a bad choice is this KB article referenced by KBAlertz.  This is reading the text from KB #200040, but it should actually be reading the text from KB #2000403.  How embarrassing!  I'm sure they are working on a fix, but as a commenter pointed out, they probably passed 2000403 into a parameter, and it silently truncated to 200040 (see how silent truncation can be a problem?).  Depending on how many KB articles they've loaded in the new 7-digit format, this could take them a while to straighten out, and it will almost certainly be a manual, one-by-one affair.

Using BIT for properties like Status

It's always dangerous to use two-three-valued logic to store a status value.  Initially you might always say that something is "on" or "off" - but later, you might need additional states such as "loading" or "archived" or "hidden."  Why not use TINYINT to leave room for expansion?  You don't lose anything since TINYINT still takes up only 1 byte.  You can add a check constraint which will make it easy to define the domain of values without having to change the data type later.


Using INT instead of BIGINT

If you have a large table with an IDENTITY column that holds transaction data, and will be doing so for very high volumes or for a very long time, you might run into a case where you run out of IDENTITY values because you hit the ~2 billion upper bound on the INT data type.  As I explained earlier in this series, a common reaction is to "design this better" by starting the IDENTITY seed at the lower bound of INT instead of 1.  This doesn't really buy you anything except that it temporarily delays the need to deal with it ... you'll still have to fix it eventually, because if you're hitting 2 billion rows now, how long will it be before you hit 4 billion?  It is relatively impossible, in our lifetimes, to use up all the positive values that a BIGINT supports, unless you set the increment to some ungodly number.  So in cases this like this it can make sense to take the hit now and use BIGINT.  If you're using SQL Server 2008, this is another area where data compression can help reduce the impact, since it can compress many of the smaller values.


Using INT instead of SMALLINT or TINYINT

Let's say you have a lookup table for sex_code.  Instead of storing M/F/U etc. (or the full words Male/Female/Unknown) in all of the tables, you just want to store a numeric representation (there are industry standards for this).  However, there is no reason to use INT here; since there is no impending evolutionary change we can envision that will require the number of options even jumping to 10, never mind > 255.  So why not use TINYINT?  It won't make a big difference to the lookup table itself, since this is going to be tiny and will likely be in memory anyway.  But on the real tables, you're going to save 3 bytes per row in storage, and your reads, joins, seeks and scans are going to be slightly more efficient. 


Using FLOAT / REAL instead of DECIMAL / NUMERIC

Since FLOAT and REAL are approximate numbers, you can run into all kinds of presentation / storage issues if you use them inadvertently.  They have their place, and if you are using scientific data that requires the properties of floating point numbers, th they are the right choice.  But in most cases, I think this is another evolutionary thing that happens accidentally when you upsize from Access.  You are much better off using DECIMAL or NUMERIC (which are functionally equivalent) unless you really know that you should be using FLOAT.  For some more background information, you can look at this thread from StackOverflow.


Using MONEY / SMALLMONEY instead of DECIMAL / NUMERIC

I see no gain in using the MONEY or SMALLMONEY types.  One benefit that a few people have related to me is that it describes *exactly* what is in the column.  This isn't just a number, it's CASH!  Like I would look at a column called "Salary" or "RetailPrice" and assume it was storing peanuts or cheese doodles.  The problem with these types is that there are definite issues with certain arithmetic operations.  You can review this thread on StackOverflow for more information.


Using SQL_VARIANT at all

A lot of people think they are clever and will use SQL_VARIANT to store various types of data in a single column.  I can tell you from experience that this will be a major headache for you before long.  Several external tools don't support it (e.g. entity framework), some providers like ODBC automatically convert the values to NVARCHAR(4000), and you can't use the column in the definition of a computed column.  It also can't be part of a primary or foreign key, nor can you use wildcard operators such as LIKE.  And even if you only store numbers, you cannot use operators like SUM(), MAX(), addition or subtraction.  For a lot of these things you need to do an explicit CONVERT(), but to avoid conversion issues, you need to be careful about your WHERE clause also.  Jonathan Kehayias talks about a few SQL_VARIANT issues in his blog post earlier this year.  Recently Microsoft PSS also posted an article about potential memory leaks involving linked servers and SQL_VARIANT.


Summary

Choosing the right data type is all about knowing the right questions to ask about how the data is being used and how it may be used in the future.  There is no one-size-fits-all answer, and we can't predict every single future change to our applications, but you should be aware of the potential pitfalls you may encounter every time you make a data type decision.

I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code.  Up next: abusing triggers. 

Published Monday, October 12, 2009 5:51 PM by AaronBertrand

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Brian Tkatch said:

"Choosing the right data type is all about knowing the right questions to ask about how the data is being used and how it may be used in the future."

Bingo!

Aaron, keep it up.

October 12, 2009 5:40 PM
 

Helping people kick bad SQL Server habits « OTO One to One Interactive said:

October 12, 2009 7:23 PM
 

mjswart said:

I think your "Using NVARCHAR instead of VARCHAR" is a "Do" in a list of "Do nots".

From the text of the item, you probably meant vice versa.

You also mention using "VARCHAR instead of CHAR" and talk about space savings. I've never seen any significant space savings gained by this rule of thumb and with the danger of underestimating the size needed (a danger you acknowledged) I'd say it's best to just choose VARCHAR and move on. See http://dbwhisperer.blogspot.com/2009/09/char-what-is-it-good-for.html

October 13, 2009 8:49 AM
 

AaronBertrand said:

Thanks Michael, you're right about that item; that heading should have said VARCHAR.  I've corrected it.

As for CHAR vs. VARCHAR, I was talking more about the row overhead.  You're probably right that CHAR is rarely appropriate, even if the data is allegedly "fixed"; ISBNs expanded to 13 characters, and I'm sure SSNs and VINs will do so as well.  

I was going with conventional wisdom on this one; and again, only suggested "should consider" with perhaps some unstated emphasis on "test for yourself."  As for my own personal use, I very rarely use CHAR.  I am looking through my codebase for the primary system I've developed and managed over the past several years, and I have a grand total of 17 CHAR or NCHAR columns (out of 334 character-based columns).  Some of these columns are things like gender and state, non-nullable, and then some are columns in views (such as datetime values converted to (CHAR(10), 120) for display purposes).

October 13, 2009 9:12 AM
 

Bob Probst said:

Nice series of articles you've got.  What's your opinion on using the "max" scale with varchars.  I've frequently fallen back on it because in the past I've had to deal with input scale changes (names an now be 56 characters not 55! Yeah!).  But still, it makes me feel like I'm being lazy.  I tell myself that varchar(max) uses just as much storage as varchar(50) for the same value but I also realize that I am misrepresenting the business logic by not constraining the input:  if name is supposed to be <= 56 bytes and I get 60 from the source system, I need to have a conversation with someone and not just hide it under the "max" rug.

Thoughts?

October 13, 2009 10:51 AM
 

AaronBertrand said:

I would never use MAX for something like this.  I use MAX when I know I'm going to need to support more than 8000 bytes (4000 characters for NVARCHAR, 8000 characters for VARCHAR).  You need to have a data dictionary and enforce it.  If the app can pass 60 characters when the database expects 56, then there is a disconnect somewhere that you need to correct, sorry.  Possible suggestion: ask what the most they will *ever* need to support.  Double it for the database column itself (e.g. if they say VARCHAR(64), then use VARCHAR(128)).  Make the parameters to your stored procedures VARCHAR(64).  So, they will lose their data if they don't listen to you and pass more than 64 characters (but ideally their data validation before they submit data matches the data types of the parameters they use).  When you really DO need to extend to 72 characters, or 96 characters, or 100 characters, or 128 characters, you will initially only have to change the input parameters as opposed to changing the underlying schema.

October 13, 2009 11:04 AM
 

jchang said:

I will add that before setting absolute hard rules for space efficiency, consider the impact.

The cluster key should be almost perfect on space efficiency, as this will be part of every non-clustered index as well.

The next priority is any column that will be in a non-clustered index key. If the key is fat, then the index will be deep, impairing index seek efficiency (even though there does not appear to be much difference between index depth 3 vs 4).

Slightly lower in priority is any column in included columns of a nonclustered index. Its not part of the key, but it still has to be duplicated between the table and nonclustered index.

The lowest priority are columns not part of any index. If you feel compelled to splurge on space usuage, do it here. This might let you add a bullet on your resume that you managed 10TB database, even though we know its because you were lax in your choice of data types

October 13, 2009 11:48 AM
 

AllenMWhite said:

Another point with the MAX designator is that it turns a normal column into a LOB column, which prevents features like online reindexing from being performed.

Great series, Aaron.

October 13, 2009 11:55 AM
 

sql_noob said:

i have been part of a move from int to bigint. the best part is if you have a server with years of archive data, older hardware and no chance to buy new hardware then you spend a few months moving the data manually.

October 13, 2009 12:20 PM
 

Zack Jones said:

Great series! Here's a couple of other issues I've seen when dealing with databases upgraded from 2000 to 2005:

1 - Using Text instead of varchar(MAX).

2 - Using Image instead of varbinary(MAX).

I also hate looking at the table definition and seeing all varchar(50) or nvarchar(255) columns.

Interesting concept about making the fields larger than the user wanted to allow for future growth.

October 13, 2009 12:45 PM
 

John McCurdie said:

Much food for thought.  My comment is about the use of English in the INT and BIGINT section:

"It is relatively impossible, in our lifetimes, to use up all the positive values that a BIGINT supports, unless you set the increment to some ungodly number."

Impossible is like a BIT function: on or off.  Use "very difficult" instead of "relatively impossible".

October 18, 2009 3:58 AM
 

Bruce W Cassidy said:

Nice article.  I like how you are keeping these very much to the topic at hand.

I'm surprised you didn't include the list of values encoded in a single VARCHAR.

One thought regarding the use of NVARCHAR for phone numbers: it's a common practice now to have words (like business names) encoded as a phone number.  You know, like the "1800 CALLMENOW" kind of thing.  Do you store those as text or as numbers?  If you store them as text, what happens in non-english speaking countries?

October 18, 2009 3:49 PM
 

Thomas said:

I disagree about using smalldatetime. IMO, it should almost never be used. Sure, it saves on space, but its maximum value is the bizarrely chosen 2079-06-06. Aside from not being intuitive, I have run into systems that initially would "never" hit that date but in reality did. Storage is cheap. If storing only the date is important, then upgrade to SQL 2008 and use the DATE data type.

RE: Using Char vs Varchar

There is more at stake than just adding a few extra characters. Equating values is different in middle-tier code. When you insert "FOO" into a Char(10), it will come back as 10 characters. While SQL statements will ignore the spaces, .NET code for example will not which means the developers need to trim every value coming back from that field if they are going to do equality tests against it.

RE: Int instead of BigInt

Again, it depends on the design. I find it interesting that you advocate smalldatetime for space reasons but not int for the same reason. I have systems I built over 10 years ago that are still in use using ints and are no where close to needing a bigint. If the system is designed well, it shouldn't be an issue to change from int to bigint later. In fact, a better argument could be made that the middle tier should assume 64-bit ints rather than the database.

RE: Using Smallint

There is a reasonable argument against smallint and tinyint in that they are not standard SQL and that using check constraints and ints (or bigints) makes your intent clearer than relying on the arbitrary restrictions of smallint and tinyint.

October 18, 2009 4:44 PM
 

AaronBertrand said:

Thomas, have you ever tried to change a column from INT to BIGINT on a table with 2 billion rows?  Your point about the upper bound on SMALLDATETIME is a good one.  Where I use SMALLDATETIME, it is for *current* data, that doesn't need sub-minute or sub-day accuracy and isn't going to be holding dates 60+ years in the future.  I'll worry about the upper bound if I'm still working with SQL Server in 2079.  :-)

Bruce, I was planning on treating the comma-separated list as a separate topic on its own.  In most cases these are used because SQL Server doesn't know what arrays are, and because it is quite convenient for applications to pass such a list in a single parameter (and in a lot of cases, such as a set of checkboxes or multi-select list items in a web form, the values are already organized in this very format).

John, I am not writing an English dissertation.  I don't think making the correction you suggest will change the meaning of the sentence for any of my other readers.

October 18, 2009 4:59 PM
 

Thomas said:

RE: int to bigint

This comes back to reasonable expectations. If (a big "if") the middle-tier is designed to accomodate bigints, then changing from int to bigint, while time consuming, isn't the end of the world. If you reasonably expect 10s of millions of rows, then yes, using a bigint is probably a good choice. If you expect at most hundreds of thousands of rows, then the use of a bigint can be reasonably questioned.

RE: Smalldatetime

The issue isn't whether you will be working with SQL in 2079. It's whether the data will need to capture something beyond D-Day of 2079. One system I build captured contract data. Who'd sign a contract for 80 years in the future? Welp, they did (or tried to). However, as you said, if you only use it to capture "today" dates, then it's fine.

October 18, 2009 5:26 PM
 

AaronBertrand said:

Thomas, I tried to prevent the argument you're presenting about BIGINT by qualifying in my description: "column that holds transaction data, and will be doing so for very high volumes or for a very long time"... hopefully you didn't interpret that as hundreds of thousands of rows.  Otherwise I'm not sure why there would be any danger in hitting the upper bound of INT.

October 18, 2009 5:31 PM
 

Thomas said:

Heh. As the saying goes, the only definitive answer is "It depends". ;->

October 18, 2009 6:43 PM
 

Bad (MS SQL Server) habits to kick « devioblog said:

October 19, 2009 1:36 AM
 

Tom K said:

Thanks for the great advice.  I still have to train myself to use SMALLDATETIME and TINYINT types when they are more appropriate.  I would also like to add one more annoyance I come across very often.

Inappropriate Use of TEXT and NTEXT:

In most cases, a large VARCHAR or NVARCHAR field will we sufficient for storing all of the characters required for these fields.  Beginning with SQL Server 2005, you can use the VARCHAR(MAX) or NVARCHAR(MAX) data types to store a virtually unlimited number of characters.  The TEXT and NTEXT types introduce many complications such as the inability to group on these fields or perform many of the common string operations.

October 19, 2009 2:45 PM
 

SeanJA said:

Choosing the wrong data type:

Storing date/time values as a unix timestamp rather than any of the date/time related types. You lose all of the date functions and have to mentally convert everything before writing. I will take DATEDIFF over (< 86400 * something) any day.

October 19, 2009 8:29 PM
 

Everest said:

I might add...go larger when in doubt. Disk space is cheap; having to change a bunch of code and downstream processes ad infinitum because someone read a book in literal fashion and thought they were doing a service by saving two bytes by using a smallint - wrong.  Two years later and the client now has a problem and a new project on their hands because of the decision.

Good read, thanks.

Lee

October 19, 2009 10:14 PM
 

csm said:

Aaron, when you're talking about BIT or TINYINT, I think you have a (little) mistake. Of course, both data types takes up only 1 byte, but when you have 8 or less BIT columns, you still needs 1 byte to store the 8 columns.

As BOL says:

"The Microsoft SQL Server 2005 Database Engine optimizes storage of bit columns. If there are 8 or less bit columns in a table, the columns are stored as 1 byte. If there are from 9 up to 16 bit columns, the columns are stored as 2 bytes, and so on."

November 12, 2009 7:30 AM
 

Aaron Bertrand said:

csm, you're right of course.  I was thinking of the simplest case, where there is only one such column in the table.

November 12, 2009 7:41 AM
 

Aaron Bertrand said:

In my last post in this series , I talked about the common habit of creating an IDENTITY column on every

February 8, 2010 10:02 PM
 

An update of my SQL Server “Bad Habits” series « OTO One to One Interactive said:

March 8, 2010 11:31 PM
 

Cup of SQL Coffee said:

Select of T-SQL/Database Bad Habits to Kick

March 21, 2010 2:44 PM
 

Select of T-SQL/Database Bad Habits to Kick « Placko's SQL KB said:

January 3, 2011 11:04 AM
 

An update of my SQL Server "Bad Habits" series | One to One said:

May 5, 2011 3:31 PM
 

Helping people kick bad SQL Server habits | One to One said:

May 5, 2011 3:32 PM
 

6 Little Known Things That Can Cause Big Performance Issues | SQLRockstar | Thomas LaRock said:

April 25, 2012 8:38 AM
 

JamieC said:

"It's always dangerous to use two-three-valued logic to store a status value" -- not sure what you are trying to say here. Did you start with "two-valued" then intended to correct to "three-valued" but mistyped? Two distinct tristate data elements? 23-valued logic?!

May 23, 2012 5:14 AM
 

Aaron Bertrand said:

JamieC it should have read "two- or three-valued logic" meaning BIT NOT NULL (true, false) vs. BIT NULL (true, false, unknown).

May 23, 2012 7:57 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About AaronBertrand

...about me...

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement