Sort Code Reuse for Multilingual Database – SQL Server Collation Guide

collationsortingsql servert-sql

I'm attempting to sort a result from a table that is multilingual. I would like to have the sorting algorithm come from a function but then it adds a 8% hit to the performance, at least. So, I'm not quite sure how to go about it. So, for the sorting I'm using this method described in an article on how to sort multilingual tables as shown below:

select UnicodeData,Collation
from (
    select
        ML.UnicodeData,
        ML.Collation,
        RN =
            CASE
                when Collation = 'he-IL' then ROW_NUMBER() OVER (order by unicodedata Collate Hebrew_CI_AS                  )
                when Collation = 'en-US' then ROW_NUMBER() OVER (order by unicodedata Collate SQL_Latin1_General_CP1_CI_AS  )
                when Collation = 'kn-IN' then ROW_NUMBER() OVER (order by unicodedata Collate Indic_General_100_CI_AS       )
                when Collation = 'hi-IN' then ROW_NUMBER() OVER (order by unicodedata Collate Indic_General_100_CI_AS       )
                when Collation = 'ar-EG' then ROW_NUMBER() OVER (order by unicodedata Collate Arabic_CI_AS                  )
                when Collation = 'cs'    then ROW_NUMBER() OVER (order by unicodedata Collate Czech_CI_AS                   )
            END
    from MultipleLanguages ML
) T
order by RN

Except I abstracted out the collation code into its own function, like so:

CREATE FUNCTION [utils].[OrderByLanguage]
( @LanguageID tinyint
, @IDName utils.ID_Name READONLY
) RETURNS TABLE AS RETURN
SELECT
      t.ID
    , CASE @LanguageID
        WHEN 1 THEN ROW_NUMBER() OVER (ORDER BY t.[Name]) -- en
        WHEN 3 THEN ROW_NUMBER() OVER (ORDER BY t.[Name]) -- en-ca
        WHEN 6 THEN ROW_NUMBER() OVER (ORDER BY t.[Name]) -- 'en-nz'
        WHEN 5 THEN ROW_NUMBER() OVER (ORDER BY t.[Name]) -- 'en-za'
        WHEN 2 THEN ROW_NUMBER() OVER (ORDER BY t.[Name] COLLATE Modern_Spanish_CI_AI) -- es
        WHEN 4 THEN ROW_NUMBER() OVER (ORDER BY t.[Name] COLLATE French_CI_AI) -- 'fr-ca'
    END RowNumber
FROM @IDName t

But then when I call this function I have to do this awkward double call on the table-valued function.

CREATE FUNCTION api.GetTable
( @LanguageCode VARCHAR(10)
) RETURNS NVARCHAR(MAX)
AS BEGIN

    DECLARE
          @Result NVARCHAR(MAX)
        , @LangID tinyint
    DECLARE @Sort utils.ID_Name

    SET @LangID = api_utils.GetLanguageID(@LanguageCode)

    INSERT INTO @Sort (ID, [Name])
    SELECT
          t.ID
        , t.title
    FROM api_utils.GetTable(@LangID) t

    SET @Result = (
        SELECT
            CONVERT(VARCHAR(10), t.ID) id,
            t.category,
            t.[system],
            t.title,
            JSON_QUERY(utils.ToRawJsonArray((
                SELECT x.[Description]
                FROM api_utils.GetKeywords(t.ID, @LangID) x
                ORDER BY x.[Description]
                FOR JSON AUTO), 'Description')
            ) keywords
        FROM api_utils.GetTable(@LangID) t
        ORDER BY (SELECT s.RowNumber
                  FROM utils.OrderByLanguage(@LangID, @Sort) s
                  WHERE s.ID = t.ID)
        FOR JSON AUTO, ROOT('titles')
        )

    RETURN @Result

END

So, you can see that I have to call the function api_utils.GetTable twice. As far as I can tell the only other way to abstract out the collation sorting would be to put in the actual sorting algorithm and then have a script that searches all the code base and add in another collation language whenever I need to add another language. Is there any other way to do this? What have others done? What would be the best practice? The performance on this isn't absolutely critical but it is nice to keep it lean so doesn't take too long since it is already an intensive call.

Thanks in advance!

Update

Answering @srutzky's questions in comments:

1) How much data is returned by api_utils.GetTable?

There's about 150 records returned from the table.

2) Why call api_utils.GetTable twice when the first time the result is dumped into @Sort?

The @Sort table is a memory optimized user defined table (UDT). Since I'm passing a table into the utils.OrderByLanguage function it needs to be a UDT. Which means I need to get the data from the inline function api_utils.GetTable twice. I'm not sure if it caused a performance issue calling api_utils.GetTable twice. Maybe SQL Server is smart enough to cache the result? Testing it again the INSERT query cost is 38%. So, a fairly significant part of the query cost.

Wouldn't it be faster to add the category and system columns to @Sort and pull them back in the first call and then use @Sort in the FROM clause?

Since the UDT is generic for use across all the different procedures that call the function utils.OrderByLanguage It would make it very difficult to generalize for an unknown number of columns that the different procedures would use.

3) Does this have to be a function or can it be a stored procedure?

Are you talking about api_utils.GetTable? I would prefer that api_utils.GetTable stay a function as it makes it easier to use and test. Where I call api_utils.GetTable it is a Stored Procedure.

If you are talking about utils.OrderByLanguage I don't mind if it would be a stored procedure. I'm not sure how that would help though. So, if it would please let me know!

Update to Accepted Answer

I found adding a an index didn't make a difference performance wise. I also figured I might as well put the sort column in the original #sort table since it has to be the same anyways. This lowers the number of warnings in my SSDT project. Then I just do an alter on the column like so:

ALTER TABLE #AlterSort ALTER COLUMN [sort] nvarchar(max) COLLATE SQL_Latin1_General_CP1_CI_AS

Best Answer

Looking at what you have so far, it's good that [utils].[OrderByLanguage] is an Inline Table-Valued Function (ITVF), but it still appears to be a correlated subquery where for each row in api_utils.GetTable(@LangID) it passes in all rows in api_utils.GetTable(@LangID) to sort it (i.e. the ORDER BY clause.

Applying Collations at run-time can be quite costly since it has to generate the sort keys for those values in that moment. For the best performance, creating an index will generate the sort keys ahead of time and even put them in the proper order. But dealing with multiple locales is indeed tricky. Creating a copy of the source column per each locale can require a huge amount of additional disk space (and I/O), depending on the size of the strings, how many locales / Collations one needs, and how many rows there will be over the next 3 - 5 years. Fortunately, you can create non-persisted computed columns to be those copies (which don't take up any space) and index those (which does take up space). While this is likely not feasible if there were 10 locales, a base column of at least NVARCHAR(200), and 1 million (or more) rows, for your situation it should work just fine. With this approach, the dynamic portion will be in choosing which column to select from (achievable via Dynamic SQL, or IF statements, depending on the situation). But as you can see with the following example (turn on "Include Actual Execution Plan"), both filtered queries (last 2 queries) get Index Seeks on the intended indexes and return the expected results:

You can see a live demo of this on dbfiddle.uk.

SET NOCOUNT ON;
-- DROP TABLE #T;
CREATE TABLE #T
(
  [ID] INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
  [Col1] VARCHAR(10) COLLATE Latin1_General_100_CI_AS,
  [Col2] AS ([Col1] COLLATE Latin1_General_100_CS_AS)
);

CREATE INDEX [IX_#T_Col1] ON [#T] ([Col1]);
CREATE INDEX [IX_#T_Col2] ON [#T] ([Col2]);

INSERT INTO #T ([Col1]) VALUES ('a'), ('A');

SELECT * FROM #T; -- 2 rows

SELECT [ID] FROM #T WHERE [Col1] = 'A'; -- 2 rows

SELECT [ID] FROM #T WHERE [Col2] = 'A'; -- 1 row

However, given the code for api.GetTable this might not be the best approach. If you wanted to keep the current structure (as much as possible), then you might could do the following:

  1. Convert api.GetTable into a Stored Procedure
  2. Use an OUTPUT parameter so you don't need to deal with a result set
  3. @Sort should be a temporary table, not a table variable
  4. Create the #Sort temporary table with all of the columns it will need except for [Name] / title): ID, category, and [system].
  5. In a series of IF statements, add the [title] column, but with the proper Collation:

    IF (@LanguageID = 2)
    BEGIN
      ALTER TABLE #Sort ADD [title] NVARCHAR(2000) COLLATE Modern_Spanish_CI_AI
    END;
    
    IF (@LanguageID = 4)
    BEGIN
      ALTER TABLE #Sort ADD [title] NVARCHAR(2000) COLLATE French_CI_AI
    END;
    
    ...
    
  6. Change FROM api_utils.GetTable(@LangID) t in the main @Result= query to be: FROM #Sort t
  7. Change ORDER BY (SELECT s.RowNumber... in the main @Result= query to be: ORDER BY t.[title]

This will require re-applying the Collation at each execution, but:

  1. it does not require any changes to base tables
  2. it will re-use the data returned by api.GetTable (no double call)
  3. it will use statistics on the [Name] column
  4. it won't be a correlated sub-query (woo hoo!)
  5. it gives you the option of creating an index on the [Name] column after it has been populated.

Code Reuse

By switching over to stored procedures and temporary tables we can actually keep with the goal of code reuse. The code to add the sort column to the local temp table can be abstracted to another stored procedure. While creating a temporary table in a sub-procedure call does not help as that temp table will disappear once that sub-procedure call ends, changes made to a temporary table that exists prior to the sub-procedure call will survive the completion of that call. For example:

Setup

CREATE PROCEDURE #AddSortColumn
(
  @LanguageID TINYINT
)
AS
SET NOCOUNT ON;

DECLARE @IsColumnAdded BIT = 0;

IF (@LanguageID = 2)
BEGIN
  ALTER TABLE #Sort ADD [title] NVARCHAR(2000) COLLATE Modern_Spanish_CI_AI;
  SET @IsColumnAdded = 1;
END;

IF (@LanguageID = 4)
BEGIN
  ALTER TABLE #Sort ADD [title] NVARCHAR(2000) COLLATE French_CI_AI;
  SET @IsColumnAdded = 1;
END;

IF (@IsColumnAdded = 0)
BEGIN
  RAISERROR(N'Invalid @LanguageID: %d', 16, 1, @LanguageID);
  RETURN;
END;
GO

Test

CREATE TABLE #Sort (ID INT);

EXEC tempdb.dbo.sp_help '#Sort'; -- only 1 column

EXEC #AddSortColumn @LanguageID = 4;

EXEC tempdb.dbo.sp_help '#Sort'; -- now there are 2 columns

By placing the specific Collations only in this one Stored Procedure, you should have only this one place to update when adding a new language to support.

Putting it all together

With all of the above in mind, we end up with the following stored procedure containing a single call to api_utils.GetTable(), no correlated sub-query, and api_utils.AddSortColumn (which you need to create) can be used in other stored procedures and against other base tables:

CREATE PROCEDURE api.GetTableAsJSON
(
  @LanguageCode VARCHAR(10),
  @JSONifiedTable NVARCHAR(MAX) OUTPUT
)
AS
SET NOCOUNT ON;

    DECLARE @LangID TINYINT;

    CREATE TABLE #Sort
    (
      [ID] INT NOT NULL PRIMARY KEY,
      [category] ...,
      [system] ...
    );

    SET @LangID = api_utils.GetLanguageID(@LanguageCode)

    EXEC api_utils.AddSortColumn @LanguageID = @LangID; -- add [Title] to #Sort

    INSERT INTO #Sort ([ID], [category], [system], [Title])
      SELECT t.[ID], t.[category], t.[system], t.[title]
      FROM   api_utils.GetTable(@LangID) t

    -- Optional index; try with and without to see which is better.
    --CREATE INDEX [IX_#Sort_Title] ON #Sort ([Title]);

    SET @JSONifiedTable = (
        SELECT
            CONVERT(VARCHAR(10), t.ID) AS [id],
            t.[category],
            t.[system],
            t.title,
            JSON_QUERY(utils.ToRawJsonArray((
                SELECT x.[Description]
                FROM api_utils.GetKeywords(t.[ID], @LangID) x
                ORDER BY x.[Description]
                FOR JSON AUTO), 'Description')
            ) AS [keywords]
        FROM #Sort t
        ORDER BY t.[Title]
        FOR JSON AUTO, ROOT('titles')
        );

    RETURN;
END;