Sql-server – Which collation should I use for biblical Hebrew

collationconfigurationdatabase-designsql serversql-server-2017

Which SQL Server collation should I use for biblical Hebrew? The database under consideration needs to accommodate diacritics (i.e., vowels, accents, trope, etc.).

Best Answer

First: There is no distinction, collation-wise, between biblical Hebrew and modern Hebrew. We are just dealing with Hebrew.

Second: Regardless of anything else, you want to use the newest set of collations, which are the _100_ series as they have newer / more complete sort weights and linguistic rules than the older series with no version number in the name (technically the are version 80). Normally I highly recommend using the newest version of whatever collation you need, but at least in this one instance there is good reason to use the version with no version number in the name. The version 100 (or newer) collations are far more complete, and can distinguish between supplementary characters (or even fully support them if using an SC or 140 collation), but assuming that you are not dealing with supplementary characters, then the version 80 (no version number in name) collations do a better job of handling Hebrew (see item "Sixth" below).

Third: There is no concept of "Kana" (or Kana-type) in Hebrew, so you can ignore any of the collation variations with _KS in their names (as that is a sensitivity that you will never use).

Forth: The collations ending in _SC support supplementary characters (i.e. full UTF-16) so usually best to pick one of those, if available (meaning: if you are using SQL Server 2012 or newer).

Fifth: You don't want a binary collation (_BIN or _BIN2) as those can't distinguish between Hebrew letters with both vowels and cantillation marks that are the same but have the combining characters in different orders, nor can they ignore vowels and other marks to equate things like א and אֽ.

For example (vowel and cantillation mark combining characters in opposite order):

SELECT NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8),
       NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C)
WHERE  NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8) =
       NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C) COLLATE Hebrew_100_CS_AS_SC;
-- אָ֜  אָ֜

SELECT NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8),
       NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C)
WHERE  NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8) =
       NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C) COLLATE Hebrew_100_BIN2;
-- no rows

Sixth: It depends on how you will be interacting with the string values. Hebrew does not have upper-case/lower-case letters, yet there are a few code points that are affected by case-sensitivity. There are even a few code points that are width-sensitive. Accent-sensitive / insensitive affects diacritics used for vowels, pronunciation, and cantillation marks (i.e. trope).

  1. Do you need to distinguish between final and non-final forms of the same letter? There are five letters in Hebrew that look different when used as the final letter of a word. SQL Server handles this via case-sensitivity / _CS collations (though, unfortunately, it appears to be broken in the newer, and usually better, version 100 and newer collations):

    SELECT NCHAR(0x05DE) AS [Mem],
           NCHAR(0x05DD) AS [Final Mem]
    WHERE  NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_CI_AS_KS_WS;
    -- 1 row (expected; all sensitive except case)
    -- Mem  Final Mem
    -- ‭מ    ם
    
    
    SELECT NCHAR(0x05DE) AS [Mem],
           NCHAR(0x05DD) AS [Final Mem]
    WHERE  NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_CS_AI;
    -- no rows (expected; all insensitive except case)
    
    
    SELECT NCHAR(0x05DE) AS [Mem],
           NCHAR(0x05DD) AS [Final Mem]
    WHERE  NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_100_CI_AI;
    -- no rows (expected 1 row; all insensitive)
    
  2. Do you need to distinguish between pronunciation marks, vowels, and cantillation marks? SQL Server handles this via accent-sensitivity / _AS collations (though, unfortunately, it appears to be broken in the newer, and usually better, version 100 and newer collations). Please note that all three of these are grouped together under accent-sensitivity and cannot be controlled separately (i.e. you can't do vowel sensitive but cantillation mark insensitive).

    Pronunciation marks

    There are several letters that have two different sounds. Sometimes the only indicator for which sound to use is the context of what word the letter is in (and sometimes even the surrounding words), such as in the actual Torah (where there are no pronunciation marks or vowels). But, that same text in other forms, as well as other text, will have dots placed either inside the letter, or for the letter Shin, above the letter. The letter Shin can have either an "sh" or "s" sound. To indicate the "sh" sound (i.e. letter "shin"), there is a dot above on the right side, while a dot above on the left side denotes the "s" sound (letter "sin"):

    SELECT NCHAR(0x05E9) AS [Shin],                            -- ש
           NCHAR(0x05E9) + NCHAR(0x05C1) AS [Shin + Shin Dot], -- שׁ
           NCHAR(0x05E9) + NCHAR(0x05C2) AS [Shin + Sin Dot]   -- שׂ
    WHERE  NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C1) COLLATE Hebrew_CS_AI_KS_WS
    AND    NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C2) COLLATE Hebrew_CS_AI_KS_WS;
    -- 1 row (expected; all sensitive except accent)
    
    
    SELECT NCHAR(0x05E9) AS [Shin],                            -- ש
           NCHAR(0x05E9) + NCHAR(0x05C1) AS [Shin + Shin Dot], -- שׁ
           NCHAR(0x05E9) + NCHAR(0x05C2) AS [Shin + Sin Dot]   -- שׂ
    WHERE  NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C1) COLLATE Hebrew_CI_AS
    OR     NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C2) COLLATE Hebrew_CI_AS;
    -- no rows (expected; all insensitive except accent)
    
    
    SELECT NCHAR(0x05E9) AS [Shin],                            -- ש
           NCHAR(0x05E9) + NCHAR(0x05C1) AS [Shin + Shin Dot], -- שׁ
           NCHAR(0x05E9) + NCHAR(0x05C2) AS [Shin + Sin Dot]   -- שׂ
    WHERE  NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C1) COLLATE Hebrew_100_CI_AI_SC
    OR     NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C2) COLLATE Hebrew_100_CI_AI_SC;
    -- no rows (expected 1 row; all insensitive)
    

    Vowels

    SELECT NCHAR(0x05D0) AS [Aleph],                           -- א
           NCHAR(0x05D0) + NCHAR(0x05B8) AS [Aleph with vowel] -- אָ
    WHERE  NCHAR(0x05D0) =
           NCHAR(0x05D0) + NCHAR(0x05B8) COLLATE Hebrew_CS_AI_KS_WS;
    -- 1 row (expected; all sensitive except accent)
    
    
    SELECT NCHAR(0x05D0) AS [Aleph],                           -- א
           NCHAR(0x05D0) + NCHAR(0x05B8) AS [Aleph with vowel] -- אָ
    WHERE  NCHAR(0x05D0) =
           NCHAR(0x05D0) + NCHAR(0x05B8) COLLATE Hebrew_CI_AS;
    -- no rows (expected; all insensitive except accent)
    
    
    SELECT NCHAR(0x05D0) AS [Aleph],                           -- א
           NCHAR(0x05D0) + NCHAR(0x05B8) AS [Aleph with vowel] -- אָ
    WHERE  NCHAR(0x05D0) =
           NCHAR(0x05D0) + NCHAR(0x05B8) COLLATE Hebrew_100_CI_AI_SC;
    -- no rows (expected 1 row; all insensitive)
    

    Cantilation marks

    Technically, according to official Unicode data, Hebrew cantillation marks are ignorable and should only register as a difference here when using a binary collation. However, SQL Server treats them the same as accents (unfortunately) and cannot ignore them separately from pronunciation marks or vowels.

    SELECT NCHAR(0x05D0) AS [Aleph],                                       -- א
           NCHAR(0x05D0) + NCHAR(0x05A8) AS [Aleph with cantillation mark] -- א֨
    WHERE  NCHAR(0x05D0) =
           NCHAR(0x05D0) + NCHAR(0x05A8) COLLATE Hebrew_CS_AI_KS_WS;
    -- 1 row (expected; all sensitive except accent)
    
    
    SELECT NCHAR(0x05D0) AS [Aleph],                                       -- א
           NCHAR(0x05D0) + NCHAR(0x05A8) AS [Aleph with cantillation mark] -- א֨
    WHERE  NCHAR(0x05D0) =
           NCHAR(0x05D0) + NCHAR(0x05A8) COLLATE Hebrew_CI_AS;
    -- no rows (expected; all insensitive except accent)
    
    
    SELECT NCHAR(0x05D0) AS [Aleph],                                       -- א
           NCHAR(0x05D0) + NCHAR(0x05A8) AS [Aleph with cantillation mark] -- א֨
    WHERE  NCHAR(0x05D0) =
           NCHAR(0x05D0) + NCHAR(0x05A8) COLLATE Hebrew_100_CI_AI_SC;
    -- no rows (expected 1 row; all insensitive)
    
  3. Do you need to distinguish between wide and non-wide forms of the same letter? There are eight letters in Hebrew that are stretched (wide) but only for the purpose of using in Torah scrolls (either hand-written / real or printed) to keep the fully-justified columnar format (which is actually how it appears in Torah scrolls). SQL Server handles this via width-sensitivity / _WS collations (interestingly-enough, it appears to be the only sensitivity that works correctly in the newer version 100 and newer collations, though sadly, it's the least likely to be used):

    SELECT NCHAR(0x05DC) AS [Lamed],
           NCHAR(0xFB25) AS [Wide Lamed]
    WHERE  NCHAR(0x05DC) = NCHAR(0xFB25) COLLATE Hebrew_CI_AI;
    -- no rows (expected 1 row; all insensitive)
    
    
    SELECT NCHAR(0x05DC) AS [Lamed],
           NCHAR(0xFB25) AS [Wide Lamed]
    WHERE  NCHAR(0x05DC) = NCHAR(0xFB25) COLLATE Hebrew_100_CS_AS_KS_SC;
    -- 1 row (expected; all sensitive except width)
    -- Lamed    Wide Lamed
    -- ‭ל        ﬥ
    
    
    SELECT NCHAR(0x05DC) AS [Lamed],
           NCHAR(0xFB25) AS [Wide Lamed]
    WHERE  NCHAR(0x05DC) = NCHAR(0xFB25) COLLATE Hebrew_100_CI_AI_WS_SC;
    -- no rows (expected; all insensitive except width)
    

So, perhaps Hebrew_CI_AI for the columns, and you can override per expression / predicate via the COLLATE statement if you need to use a variation, such as COLLATE Hebrew_CS_AI or Hebrew_CI_AS or Hebrew_CS_AS.

Additional Notes

  1. You will need to store the data in NVARCHAR columns / variables. You could accomplish most of this in regular 8-bit VARCHAR using the Windows-1255 code page (which is what all Hebrew_* collations use), including combining characters for vowels and pronunciation dots:

    ;WITH Hebrew AS
    (
        SELECT NCHAR(0x05E9) + NCHAR(0x05C1) + NCHAR(0x05B8)
               COLLATE Hebrew_100_CS_AS AS [Shin]
    )
    SELECT
      Hebrew.[Shin] AS [Unicode],
      CONVERT(VARCHAR(20), Hebrew.[Shin]) AS [CodePage1255],
      CONVERT(VARBINARY(10), CONVERT(VARCHAR(20), Hebrew.[Shin])) AS [CodePage1255_bytes]
    FROM   Hebrew;
    -- Unicode  CodePage1255    CodePage1255_bytes
    -- שָׁ        שָׁ               F9D1C8
    
    

    However, only the Unicode Hebrew block contains the cantillation marks (i.e. trope; code points U+0591 through U+05AF) plus a few extras (code points U+05C4 through U+05C7), and the Alphabetic Presentation Forms block contains the wide variant of several letters plus some other stuff.

  2. Please see the official Unicode Standard (Version 12.1), Chapter 9: Middle East-I, Section 9.1: Hebrew, pages 361-366 (389-394 of the PDF) for a more detailed description of how Unicode handles Hebrew text.
  3. According to the official Unicode CLDR (locale-specific tailorings) rules for the Hebrew ("he" and "he-IL") culture, U+05F3 HEBREW PUNCTUATION GERESH should either match or come before, U+0027 APOSTROPHE. Normally, U+05F3 sorts after apostrophe. This behavior is indeed seen when using the ICU collation demo and switching between the "root" / standard sort order (used by US English / "en-US") and "he". However, this behavior does not seem to be available in either .NET or SQL Server:

    SELECT NCHAR(0x05F3)
    WHERE  NCHAR(0x05F3) <= N'''' COLLATE Hebrew_100_CS_AS_KS_WS;
    -- no rows
    
    SELECT NCHAR(0x05F3)
    WHERE  NCHAR(0x05F3) <= N'''' COLLATE Hebrew_CS_AS_KS_WS;
    -- no rows
    

    While an unfortunate omission, it does make sense given that I don't see any "he" or "he-IL" -specific tailorings in the Windows Sorting Weight Table files. This could very well mean that there is no actual difference between the Hebrew_* and Latin1_General_* collations outside of the associated code page, which is only used for VARCHAR data, and that does not apply here.

  4. Please see related question and answer: How To Strip Hebrew Accent Marks

O.P. replied with:

Yes, I need to distinguish between: 1) final and non-final forms of the same letter 2) pronunciation marks 3) vowels, and 4) cantillation marks.

In that case, since you don't need to ignore differences between those properties, you can use the 100 level collations. The example below shows a Hebrew letter (Sin) with a pronunciation mark, a cantillation mark, and a vowel. There are six versions so that each possible combination of ordering of the combining characters could be represented. There is a seventh entry that uses the other dot to create the letter Shin with the same base letter, vowel, and cantillation mark. The query shows that only the six "Sin" entries match each other (even with different byte orders), but not the "Shin".

I included the use of Latin1_General and Japanese_XJIS_140 collations to show that the rules work with those as well in case you need to make use of them (the 140 collations, only in Japanese, have more upper-case / lower-case mappings than older versions). But in general, probably best to stick with the Hebrew collations, and use a non-100 version if you need to ignore differences in vowels, marks, dots, and final vs non-final forms.

DECLARE @Shin   NVARCHAR(5) = NCHAR(0x05E9), -- base Hebrew letter
        @Dot    NVARCHAR(5) = NCHAR(0x05C2), -- Sin Dot
        @Mark   NVARCHAR(5) = NCHAR(0x05A8), -- Cantillation Mark (i.e. trope)
        @Vowel  NVARCHAR(5) = NCHAR(0x05B8); -- Vowel

DECLARE @Dot_Mark_Vowel NVARCHAR(20) = @Shin + @Dot + @Mark + @Vowel,
        @Dot_Vowel_Mark NVARCHAR(20) = @Shin + @Dot + @Vowel + @Mark,
        @Vowel_Dot_Mark NVARCHAR(20) = @Shin + @Vowel + @Dot + @Mark,
        @Vowel_Mark_Dot NVARCHAR(20) = @Shin + @Vowel + @Mark + @Dot,
        @Mark_Vowel_Dot NVARCHAR(20) = @Shin + @Mark + @Vowel + @Dot,
        @Mark_Dot_Vowel NVARCHAR(20) = @Shin + @Mark + @Dot + @Vowel,
        @ShinDot_Mark_Vowel NVARCHAR(20) = @Shin + NCHAR(0x05C1) + @Mark + @Vowel;

SELECT @Dot_Mark_Vowel AS [Sin], @ShinDot_Mark_Vowel AS [Shin];

;WITH chr AS
(
  SELECT *
  FROM   (VALUES
            (@Dot_Mark_Vowel, 'Dot + Mark + Vowel'),
            (@Dot_Vowel_Mark, 'Dot + Vowel + Mark'),
            (@Vowel_Dot_Mark, 'Vowel + Dot + Mark'),
            (@Vowel_Mark_Dot, 'Vowel + Mark + Dot'),
            (@Mark_Vowel_Dot, 'Mark + Vowel + Dot'),
            (@Mark_Dot_Vowel, 'Mark + Dot + Vowel'),
            (@ShinDot_Mark_Vowel, 'ShinDot + Mark + Vowel')
         ) tmp([Hebrew], [Description])
) SELECT chr1.[Hebrew],
         '--' AS [---],
         chr1.[Description] AS [Description_1],
         CONVERT(VARBINARY(20), RIGHT(chr1.[Hebrew], 3)) AS [Bytes_1],
         '--' AS [---],
         chr2.[Description] AS [Description_2],
         CONVERT(VARBINARY(20), RIGHT(chr2.[Hebrew], 3)) AS [Bytes_2]
  FROM   chr chr1
  CROSS JOIN chr chr2
  WHERE  chr1.[Description] <> chr2.[Description] -- do not compare item to itself
  AND    chr1.[Hebrew] = chr2.[Hebrew] COLLATE Hebrew_100_CS_AS_SC
  AND    chr1.[Hebrew] = chr2.[Hebrew] COLLATE Latin1_General_100_CS_AS_SC
  AND    chr1.[Hebrew] = chr2.[Hebrew] COLLATE Japanese_XJIS_140_CS_AS;

-- this query returns 30 rows