Which SQL Server collation should I use for biblical Hebrew? The database under consideration needs to accommodate diacritics (i.e., vowels, accents, trope, etc.).
Sql-server – Which collation should I use for biblical Hebrew
collationconfigurationdatabase-designsql serversql-server-2017
Related Question
- Sql-server – Server default collation change or not change
- SQL Server Collation – Using DB Collation for Comparison
- SQL Server Collation – What Collation to Use for Ukraine?
- Understanding Collation Compatibility_60_406_30001 in SQL Server
- Sql-server – Search for Arabic text ignoring diacritics, alef hamza differences, and kashida in SQL Server and Oracle
- Database Collation – Case-Sensitive and Accent-Insensitive Options
- SQL Server Collation – Default vs Database Collation Issues
- Sql-server – Handle development issues when dealing with different collation for the instance/database setting
Best Answer
First: There is no distinction, collation-wise, between biblical Hebrew and modern Hebrew. We are just dealing with Hebrew.
Second:
Regardless of anything else, you want to use the newest set of collations, which are theNormally I highly recommend using the newest version of whatever collation you need, but at least in this one instance there is good reason to use the version with no version number in the name. The version 100 (or newer) collations are far more complete, and can distinguish between supplementary characters (or even fully support them if using an_100_
series as they have newer / more complete sort weights and linguistic rules than the older series with no version number in the name (technically the are version80
).SC
or140
collation), but assuming that you are not dealing with supplementary characters, then the version 80 (no version number in name) collations do a better job of handling Hebrew (see item "Sixth" below).Third: There is no concept of "Kana" (or Kana-type) in Hebrew, so you can ignore any of the collation variations with
_KS
in their names (as that is a sensitivity that you will never use).Forth: The collations ending in
_SC
support supplementary characters (i.e. full UTF-16) so usually best to pick one of those, if available (meaning: if you are using SQL Server 2012 or newer).Fifth: You don't want a binary collation (
_BIN
or_BIN2
) as those can't distinguish between Hebrew letters with both vowels and cantillation marks that are the same but have the combining characters in different orders, nor can they ignore vowels and other marks to equate things likeא
andאֽ
.For example (vowel and cantillation mark combining characters in opposite order):
Sixth: It depends on how you will be interacting with the string values. Hebrew does not have upper-case/lower-case letters, yet there are a few code points that are affected by case-sensitivity. There are even a few code points that are width-sensitive. Accent-sensitive / insensitive affects diacritics used for vowels, pronunciation, and cantillation marks (i.e. trope).
Do you need to distinguish between final and non-final forms of the same letter? There are five letters in Hebrew that look different when used as the final letter of a word. SQL Server handles this via case-sensitivity /
_CS
collations (though, unfortunately, it appears to be broken in the newer, and usually better, version 100 and newer collations):Do you need to distinguish between pronunciation marks, vowels, and cantillation marks? SQL Server handles this via accent-sensitivity /
_AS
collations (though, unfortunately, it appears to be broken in the newer, and usually better, version 100 and newer collations). Please note that all three of these are grouped together under accent-sensitivity and cannot be controlled separately (i.e. you can't do vowel sensitive but cantillation mark insensitive).Pronunciation marks
There are several letters that have two different sounds. Sometimes the only indicator for which sound to use is the context of what word the letter is in (and sometimes even the surrounding words), such as in the actual Torah (where there are no pronunciation marks or vowels). But, that same text in other forms, as well as other text, will have dots placed either inside the letter, or for the letter Shin, above the letter. The letter Shin can have either an "sh" or "s" sound. To indicate the "sh" sound (i.e. letter "shin"), there is a dot above on the right side, while a dot above on the left side denotes the "s" sound (letter "sin"):
Vowels
Cantilation marks
Technically, according to official Unicode data, Hebrew cantillation marks are ignorable and should only register as a difference here when using a binary collation. However, SQL Server treats them the same as accents (unfortunately) and cannot ignore them separately from pronunciation marks or vowels.
Do you need to distinguish between wide and non-wide forms of the same letter? There are eight letters in Hebrew that are stretched (wide) but only for the purpose of using in Torah scrolls (either hand-written / real or printed) to keep the fully-justified columnar format (which is actually how it appears in Torah scrolls). SQL Server handles this via width-sensitivity /
_WS
collations (interestingly-enough, it appears to be the only sensitivity that works correctly in the newer version 100 and newer collations, though sadly, it's the least likely to be used):So, perhaps
Hebrew_CI_AI
for the columns, and you can override per expression / predicate via theCOLLATE
statement if you need to use a variation, such asCOLLATE Hebrew_CS_AI
orHebrew_CI_AS
orHebrew_CS_AS
.Additional Notes
You will need to store the data in
NVARCHAR
columns / variables. You could accomplish most of this in regular 8-bitVARCHAR
using the Windows-1255 code page (which is what allHebrew_*
collations use), including combining characters for vowels and pronunciation dots:However, only the Unicode Hebrew block contains the cantillation marks (i.e. trope; code points U+0591 through U+05AF) plus a few extras (code points U+05C4 through U+05C7), and the Alphabetic Presentation Forms block contains the wide variant of several letters plus some other stuff.
According to the official Unicode CLDR (locale-specific tailorings) rules for the Hebrew ("he" and "he-IL") culture, U+05F3 HEBREW PUNCTUATION GERESH should either match or come before, U+0027 APOSTROPHE. Normally, U+05F3 sorts after apostrophe. This behavior is indeed seen when using the ICU collation demo and switching between the "root" / standard sort order (used by US English / "en-US") and "he". However, this behavior does not seem to be available in either .NET or SQL Server:
While an unfortunate omission, it does make sense given that I don't see any "he" or "he-IL" -specific tailorings in the Windows Sorting Weight Table files. This could very well mean that there is no actual difference between the
Hebrew_*
andLatin1_General_*
collations outside of the associated code page, which is only used forVARCHAR
data, and that does not apply here.O.P. replied with:
In that case, since you don't need to ignore differences between those properties, you can use the 100 level collations. The example below shows a Hebrew letter (Sin) with a pronunciation mark, a cantillation mark, and a vowel. There are six versions so that each possible combination of ordering of the combining characters could be represented. There is a seventh entry that uses the other dot to create the letter Shin with the same base letter, vowel, and cantillation mark. The query shows that only the six "Sin" entries match each other (even with different byte orders), but not the "Shin".
I included the use of
Latin1_General
andJapanese_XJIS_140
collations to show that the rules work with those as well in case you need to make use of them (the140
collations, only in Japanese, have more upper-case / lower-case mappings than older versions). But in general, probably best to stick with the Hebrew collations, and use a non-100
version if you need to ignore differences in vowels, marks, dots, and final vs non-final forms.