What would be some really good reasons to not use the SQL_Latin1_General_CI_AS collation on a system which handles English, German, Japanese, and Chinese data?
I'm having a really hard time finding good sources which compare collations and answer my above question as well as the following
- How is Japanese_CI_AS different from SQL_Latin1_General_CI_AS when it comes to sorting non-Japanese characters?
- Does UCA sort differently than Japanese_CI_AS and SQL_Latin1_General_CI_AS?
- What is standard industry practice for systems used globally? (For example, what does Facebook, Twitter, Google, Flickr, Baidu or Microsoft, IBM, and SAP use?)
- Does SQL_Latin1_General_CI_AS define a sort order for Japanese characters? If not, how/why does Japanese text work in SQL_Latin1_General_CI_AS?
Basically I'm trying to learn how to pick the appropriate collation 🙂
Thanks in advance!
Best Answer
Collations in SQL Server handle several aspects of string data:
Locale / LCID (referring to the Culture: en-US, fr-FR, etc)
This is used to determine culture-specific overrides to the default linguistic sorting and comparison rules used by Unicode /
NVARCHAR
data across all Collations as well as non-Unicode /VARCHAR
data for Windows (i.e. non-SQL_
) Collations.Code Page
This is the character set used for non-Unicode /
VARCHAR
across all Collations. To be clear, Code Pages do not apply to Unicode /NVARCHAR
data as Unicode is a single character set. And to be super clear, Unicode is a single character set regardless of how it is encoded: UTF-8, UTF-16, or UTF-32.Sensitivity
Case and Accent sensitivity can be controlled across all Collations. Kana and Width sensitivity can only be controlled when using the Windows Collations and are assumed to be "insensitive" when using the
SQL_
Collations.Also, all of the Windows Collations should have a binary option (at least the deprecated
_BIN
, if not also_BIN2
) whereas there are only twoSQL_
Collations that have the_BIN
/_BIN2
options:SQL_Latin1_General_CP850
andSQL_Latin1_General_CP437
.Ability to handle Supplementary Characters
A set of Collations with names ending in
_SC
were added in SQL Server 2012. These allow for proper sorting / comparison as well as handling by built-in functions for UTF-16 Surrogate Pairs (which is how UTF-16 encodes Code Points above U+FFFF). This option is not available to any of theSQL_
Collations.Please note that regardless of Collation, all UTF-16 data can be stored and retrieved without any data-loss in
NVARCHAR
/NCHAR
/XML
columns and variables, even if the Collation does not allow for properly interpreting Supplementary Characters.Also, there are some behavioral differences for non-Unicode /
VARCHAR
data only when using theSQL_
Collations:CHAR(0)
does not equate to an empty string.SQL_Latin1_General_CP1_CS_AS
), upper-case characters will sort ahead of lower-case characters.a-f
sorts beforeaa
using String-sort, but after it when using Word-sort).'æ' = 'ae'
)There is, however, behavioral consistency between
NVARCHAR
data using any collation andVARCHAR
data using a Windows Collation.Hence, ideally, the
SQL_
Collations shouldn't be used given the above restrictions and differences, not to mention that they are also deprecated (and there are only 77 of them and 3810 Windows Collations as of SQL Server 2014). If anything, try to use the most recent version of a particular Collation (e.g._100_
), and if offered, use one ending in_SC
.Unfortunately,
SQL_Latin1_General_CP1_CI_AS
is the default when installing a new instance in the US (at least). But one should not willingly pick aSQL_
Collation for new development, especially when needing to deal with multiple cultures.But to answer the 4 additional questions:
This is with regards to
NVARCHAR
data only, right? The LCID determines which culture-specific overrides to apply to the default sort options. I would suspect that the US-English characters will sort the same between Japanese and Latin1 Collations, but I am not sure if that holds true for other languages that use those characters as well, or for characters not found in US-English, such as letters with accents. And an added complication is that you have both letters already with accents and then letters without the accents combined with just the accents (i.e. combining characters) and those things might not equate the same across the English / German / Japanese / Chinese locales.I am not sure this question makes sense. There is a default sort order given to all characters. Then each locale can override (either replace or remove) any of those defaults or add new rules. So the UCA is the base weights for characters, but then each culture can deviate from those defaults. Hence there will be a large amount of overlap in the rules, but also a large amount of variation between them.
Not sure what those companies do exactly, but I doubt it is pre-indexed with culture-specific linguistic rules. At least not across ALL data. Most sites do ask for your preferred language and might use that to handle some of the data. Either way, there is no way to have a single, truly culture-independent sorting.
Not sure what is meant by Japanese text "working", but Unicode is a single character set for all languages. So, the ability to store the characters of a particular language does not imply the rules by which those characters will be sorted.
As mentioned above, the UCA is a default sort order for all characters. The Latin1 Collations can do basic sorting across all languages (in terms of Unicode /
NVARCHAR
data), but it would only have the default rules. The Latin1 Collations wouldn't have any culture-specific rules, and there can be multiple sets of those rules. Also, as stated above, theSQL_
Collations do not have the ability to turn on Kana or Width sensitivity, which you might need.