THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Beware the ß

I stumbled upon an interesting little nuance of SQL Server’s behaviour over the past few days that I was not aware of and I figure its worth talking about it here so others are aware. It concerns the handling of the german character “ß” which I understand to be german shorthand for “ss” (I’m no expert on language or linguists so don’t beat me up if I’m wrong about that).

In short, two NVARCHAR values that differ only by one using “ß” and the other using “ss” will be treated as the same. This is easily demonstrated using the following code snippet:

SELECT 'Oktoberstrasse'
UNION
SELECT
'Oktoberstraße';
SELECT N'Oktoberstrasse'
UNION
SELECT
N'Oktoberstraße';

which returns:

image

(Collation on my database is set to ‘SQL_Latin1_General_CP1_CI_AS’)

Notice that casting the values as NVARCHAR (which is what the N'' notation does) causes SQL Server to treat them as the same. Obviously this is going to cause a problem if you need to treat those as distinct values (such as inserting into a column with a unique key upon it – which is the problem I encountered that caused me to stumble across this)

There is a bug submission to Connect regarding this issue at 'ß' and 'ss' are NOT equal in which a member of the SQL Server team says:

Our current behavior follows the SQL/ISO standard and unless those standards are updated with the latest changes we don't intend to change the behavior in SQL Server. Changing existing SQL Server behavior has lot of implications and today we rely on Windows for all of our windows collation sorting capabilities. If and when in the future Windows adopts these new rules / Unicode standard we will incorporate it in SQL Server.

In other words, SQL Server is merely following the behaviour as defined by the International Standards Organisation so its not a bug in SQL Server as such, just a nuance that one needs to be aware of. And now you are.

@Jamiet

Published Friday, December 13, 2013 12:33 PM by jamiet
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Martin Preiß said:

I concur: the letter ß is a terrible character in information processing.

Regards

Martin Preiß

December 13, 2013 10:17 AM
 

Ian Yates said:

FYI, I have my SQL installs set to LATIN1_GENERAL_CI_AS (the default that appears for Australian English Windows installations).  On such a collation, at least when run from master, both statements only return a single value...

Forcing the collation like so

SELECT 'Oktoberstrasse'  COLLATE SQL_Latin1_General_CP1_CI_AS

UNION

SELECT 'Oktoberstraße' COLLATE SQL_Latin1_General_CP1_CI_AS;

indeed produces your results.

I imagine you can reproduce my results my doing a similar forced collation to LATIN1_GENERAL_CI_AS.

December 16, 2013 8:28 AM
 

Darren Gosbell said:

From my dim memories of high school German I believe that ß is not actually shorthand for 'ss' - it's an additional letter in the German alphabet and 'ss' is just an accepted substitute when using non-german keyboards. So you could not refer to SSDT-BI as ßDT-BI :)

December 16, 2013 3:04 PM
 

Janos said:

ß has been eliminated in the German language in 2006 by the German spelling reform as far as I remember.

SQL is correct in terms of comparison (Oktoberstraße = Oktoberstrasse) and dependig what you want to achieve with union, I would use UNION ALL. There are much more difficult languages with special accents and different letters look like the same (like czech, slovak, hungarian, etc) They makes more headache for me sometimes in SQL Server :)

December 28, 2013 5:14 PM
 

Boris said:

Janos, 'ß' has not been eliminated with the spelling reform, it was just even more seperated from the 'ss' (I am german...).

Darren is right as you "can" replace 'ß' with 'ss' (e.g. in Uppercase there is no 'ß' and when writing in other languages that don't have 'ß' as you write 'ae' for 'ä'), but not the other way round.

Reading this blog post a few weeks ago I thought 'Oh well, nice to know'. Unfortunately at the moment I am just having the exact problem of 'ß' and 'ss' being grouped by the SQL Server (which I use to build my dimension table) but not treated the same by the SSIS lookup so I get no matching row...

January 17, 2014 4:24 AM
 

jamiet said:

Hi Boris,

SSIS doesn't behave the same way? Urgh, that's horrible, though doesn't surprise me at all. What data types are you using within the pipeline and the lookup component, DT_STR or DT_WSTR?

JT

January 17, 2014 7:00 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement