This post is about the political issues involved with using multiple languages in a global organization and how to troubleshoot the technical details. The CHAR and VARCHAR data types are NOT suitable for global data. Some people still cling to CHAR and VARCHAR justifying their use by truthfully saying that they only take up half the space of NCHAR and NVARCHAR data types. But you’ll never be able to store Chinese, Korean, Greek, Japanese, Arabic, or many other languages unless you either use NCHAR and NVARCHAR which are double byte Unicode data types or use specific collations with CHAR and VARCHAR. Using collations is something I will address in a future post. Using CHAR and VARCHAR with collations does not save space and is trickier, which is why this post is focusing on using Unicode.
In troubleshooting a problem displaying Chinese characters, I need a test character to experiment with. I picked 坡 (decimal 22369, hexadecimal 5761) as my test character. Why? Because it looked Chinese to me! Now doesn’t that remind you of Act 1, Scene 2 of William Shakespeare’s Julius Caesar where Casca says to Cassius “but, for mine own part, it was Greek to me”? It really looked Chinese to me, but then, what do I know? That presented a problem. What if it meant something really offensive? After consulting a Chinese immigrant, I was relieved to know that my randomly chosen character was politically correct. And just to be extra safe, I double checked with fellow SQL Server MVP Greg Low. Thanks, Greg.
I wrote the following test code to show you what happens when Unicode (which is always double byte) is implicitly converted to single byte CHAR or VARCHAR.
create table #c (sqlString NVARCHAR(30), sqlChar NCHAR(1));
insert into #c values (N'insert into #c values ( ''坡'')', '坡'); -- invalid single byte syntax
insert into #c values (N'insert into #c values (N''坡'')', N'坡'); -- valid double byte Unicode syntax
, UNICODE(sqlChar) as N'UNICODE(sqlChar)'
drop table #c;
sqlString sqlChar UNICODE(sqlChar)
------------------------------ ------- ----------------
insert into #c values ( '坡') ? 63
insert into #c values (N'坡') 坡 22369
Here is another test you can run:
select'坡' , N'坡'
The point is that if you start with a Unicode character such as 坡 but you insert it as '坡', it is no longer a double byte character. You’ve crammed it into a single byte and corrupted it. You have to insert it as N'坡' to prevent corruption and you must have a double byte destination for it. By the way, the N stands for National Language. You may also see references to National Language Setting(s) or even National Character Set (NCS).
The Character Map utility in Windows is helpful when working with Unicode characters.
Figure 1. Character Map utility in Windows.
Personally I prefer the richer functionality of BabelMap, which is available as a free download, online application, or portable application.
Figure 2. BabelMap free Unicode utility.