SQL_Latin1_General_CI_AS – Why Not Use for a Global System?

collationsql serversql-server-2008-r2

What would be some really good reasons to not use the SQL_Latin1_General_CI_AS collation on a system which handles English, German, Japanese, and Chinese data?

I'm having a really hard time finding good sources which compare collations and answer my above question as well as the following

How is Japanese_CI_AS different from SQL_Latin1_General_CI_AS when it comes to sorting non-Japanese characters?
Does UCA sort differently than Japanese_CI_AS and SQL_Latin1_General_CI_AS?
What is standard industry practice for systems used globally? (For example, what does Facebook, Twitter, Google, Flickr, Baidu or Microsoft, IBM, and SAP use?)
Does SQL_Latin1_General_CI_AS define a sort order for Japanese characters? If not, how/why does Japanese text work in SQL_Latin1_General_CI_AS?

Basically I'm trying to learn how to pick the appropriate collation 🙂

Thanks in advance!

Best Answer

Collations in SQL Server handle several aspects of string data:

Locale / LCID (referring to the Culture: en-US, fr-FR, etc)

This is used to determine culture-specific overrides to the default linguistic sorting and comparison rules used by Unicode / NVARCHAR data across all Collations as well as non-Unicode / VARCHAR data for Windows (i.e. non-SQL_) Collations.
Code Page

This is the character set used for non-Unicode / VARCHAR across all Collations. To be clear, Code Pages do not apply to Unicode / NVARCHAR data as Unicode is a single character set. And to be super clear, Unicode is a single character set regardless of how it is encoded: UTF-8, UTF-16, or UTF-32.
Sensitivity

Case and Accent sensitivity can be controlled across all Collations. Kana and Width sensitivity can only be controlled when using the Windows Collations and are assumed to be "insensitive" when using the SQL_ Collations.

Also, all of the Windows Collations should have a binary option (at least the deprecated _BIN, if not also _BIN2) whereas there are only two SQL_ Collations that have the _BIN / _BIN2 options: SQL_Latin1_General_CP850 and SQL_Latin1_General_CP437.
Ability to handle Supplementary Characters

A set of Collations with names ending in _SC were added in SQL Server 2012. These allow for proper sorting / comparison as well as handling by built-in functions for UTF-16 Surrogate Pairs (which is how UTF-16 encodes Code Points above U+FFFF). This option is not available to any of the SQL_ Collations.

Please note that regardless of Collation, all UTF-16 data can be stored and retrieved without any data-loss in NVARCHAR / NCHAR / XML columns and variables, even if the Collation does not allow for properly interpreting Supplementary Characters.

Also, there are some behavioral differences for non-Unicode / VARCHAR data only when using the SQL_ Collations:

ASCII value 0 / CHAR(0) does not equate to an empty string.
When sorting data using a case-sensitive Collation (e.g. SQL_Latin1_General_CP1_CS_AS), upper-case characters will sort ahead of lower-case characters.
String-sort is used, which sorts all punctuation ahead of all letters, instead of Word-sort, which ignores dashes and apostrophes within words (e.g. a-f sorts before aa using String-sort, but after it when using Word-sort).
No culture-specific character expansions are done (e.g. 'æ' = 'ae')

There is, however, behavioral consistency between NVARCHAR data using any collation and VARCHAR data using a Windows Collation.

Hence, ideally, the SQL_ Collations shouldn't be used given the above restrictions and differences, not to mention that they are also deprecated (and there are only 77 of them and 3810 Windows Collations as of SQL Server 2014). If anything, try to use the most recent version of a particular Collation (e.g. _100_), and if offered, use one ending in _SC.

Unfortunately, SQL_Latin1_General_CP1_CI_AS is the default when installing a new instance in the US (at least). But one should not willingly pick a SQL_ Collation for new development, especially when needing to deal with multiple cultures.

But to answer the 4 additional questions:

How is Japanese_CI_AS different from SQL_Latin1_General_CI_AS when it comes to sorting non-Japanese characters?

This is with regards to NVARCHAR data only, right? The LCID determines which culture-specific overrides to apply to the default sort options. I would suspect that the US-English characters will sort the same between Japanese and Latin1 Collations, but I am not sure if that holds true for other languages that use those characters as well, or for characters not found in US-English, such as letters with accents. And an added complication is that you have both letters already with accents and then letters without the accents combined with just the accents (i.e. combining characters) and those things might not equate the same across the English / German / Japanese / Chinese locales.
Does UCA sort differently than Japanese_CI_AS and SQL_Latin1_General_CI_AS?

I am not sure this question makes sense. There is a default sort order given to all characters. Then each locale can override (either replace or remove) any of those defaults or add new rules. So the UCA is the base weights for characters, but then each culture can deviate from those defaults. Hence there will be a large amount of overlap in the rules, but also a large amount of variation between them.
What is standard industry practice for systems used globally? (For example, what does Facebook, Twitter, Google, Flickr, Baidu or Microsoft, IBM, and SAP use?)

Not sure what those companies do exactly, but I doubt it is pre-indexed with culture-specific linguistic rules. At least not across ALL data. Most sites do ask for your preferred language and might use that to handle some of the data. Either way, there is no way to have a single, truly culture-independent sorting.
Does SQL_Latin1_General_CI_AS define a sort order for Japanese characters? If not, how/why does Japanese text work in SQL_Latin1_General_CI_AS?

Not sure what is meant by Japanese text "working", but Unicode is a single character set for all languages. So, the ability to store the characters of a particular language does not imply the rules by which those characters will be sorted.

As mentioned above, the UCA is a default sort order for all characters. The Latin1 Collations can do basic sorting across all languages (in terms of Unicode / NVARCHAR data), but it would only have the default rules. The Latin1 Collations wouldn't have any culture-specific rules, and there can be multiple sets of those rules. Also, as stated above, the SQL_ Collations do not have the ability to turn on Kana or Width sensitivity, which you might need.

Best Answer

Related Solutions

Mysql – Why cp1256 collation is not 1 byte in MySQL

SQL Server – Why is There a Collation Conflict Between System Views

Related Question