It is possible to store strings with different collations in the same column using SQL_VARIANT:
CREATE TABLE dbo.Localized
(
text_id INTEGER NOT NULL,
lang_id INTEGER NOT NULL,
text_body SQL_VARIANT NOT NULL,
CONSTRAINT [PK dbo.Localized text_id, lang_id]
PRIMARY KEY CLUSTERED (text_id, lang_id),
)
GO
INSERT dbo.Localized
(text_id, lang_id, text_body)
VALUES
(1001, 2057, N'Database problems' COLLATE Latin1_General_CI_AS);
GO
INSERT dbo.Localized
(text_id, lang_id, text_body)
VALUES
(1001, 1025, N'قاعدة بيانات المشاكل' COLLATE Arabic_CI_AS)
This design has several drawbacks (including being limited to 8000 bytes), not least in the search area: SQL_VARIANT
cannot be full-text indexed, and some string comparison features (e.g. LIKE
) cannot be used directly either. On the other hand, it is possible to create a regular index on SQL_VARIANT
and perform the more basic comparisons (e.g. <, =, >) in a collation-aware fashion:
CREATE UNIQUE INDEX uq1 ON dbo.Localized (text_body)
GO
-- One row
SELECT
l.*
FROM dbo.Localized AS l
WHERE
l.text_body = CONVERT(SQL_VARIANT, N'Database problems' COLLATE Latin1_General_CI_AS)
-- No rows (and no collation error!)
SELECT
l.*
FROM dbo.Localized AS l
WHERE
l.text_body = CONVERT(SQL_VARIANT, N'Database problems' COLLATE Arabic_CI_AS)
-- One row, index seek, manual version of "LIKE 'D%'"
SELECT
l.*
FROM dbo.Localized AS l
WHERE
l.text_body >= CONVERT(SQL_VARIANT, N'D' COLLATE Latin1_General_CI_AS)
AND l.text_body < CONVERT(SQL_VARIANT, N'E' COLLATE Latin1_General_CI_AS)
We can also write the usual sort of procedures:
CREATE PROCEDURE dbo.GetLocalizedString
@text_id INTEGER,
@lang_id INTEGER,
@text_body SQL_VARIANT OUTPUT
AS
BEGIN
SELECT
@text_body = l.text_body
FROM dbo.Localized AS l
WHERE
l.text_id = @text_id
AND l.lang_id = @lang_id
END
GO
DECLARE @text SQL_VARIANT
EXECUTE dbo.GetLocalizedString
@text_id = 1001,
@lang_id = 1025,
@text_body = @text OUTPUT
SELECT @text
Of course, full-text indexing is also problematic in the "single table for all translations" design, since full-text indexing (all but) requires a language id setting per column. The multiple table design described by Joop Eggen could be full-text indexed (though it would naturally require one index per table).
The other main option is to have one column per locale in the base table:
CREATE TABLE dbo.Example
(
text_id INTEGER NOT NULL,
text_2057 NVARCHAR(MAX) COLLATE Latin1_General_CI_AS NULL,
text_1025 NVARCHAR(MAX) COLLATE Arabic_CI_AS NULL,
CONSTRAINT [PK dbo.Example text_id]
PRIMARY KEY CLUSTERED (text_id)
)
This arrangement does have a certain simplicity to it, and works well with full-text indexing, though it does require a new column to be added with each new language, and many developers find this sort of structure inelegant and unsatisfactory to work with.
Each of the alternatives has advantages and disadvantages, and will require indirection at some level or another, so it might depend on where the developers concerned feel happiest locating that indirection. I imagine most people will prefer the multiple-table design for most purposes.
We've done a lot of this, and (administrative) users were allowed to fix the translations live. (You still might want a caching layer, but I'm totally down with driving this with a real database and not resource files - it gives you a ton of power to query and find things which need to be translated, etc). I think your schema is probably fine, so I'll just pass on some stuff we learned in the hope that it's useful.
One thing you have left out is phrases with insertion points. In the example below, the order is reversed and the language is still English, but this could very easily be two different languages - pretend this is just two languages who normally put things in a different order.
Hello, <username> you have <x> points!
You've got <x> points to spend, <username>!
In our pre-.NET, we had a routine which did the insertion so the phrases would look like this:
Hello, {0:username} you have {1:points} points!
You've got {1:points} points to spend, {0:username}!
This would obviously simply be used in your code as <%= String.Format(phrase, username, points); %>
or similar
Which helped the translator a bit. But .NET String.FOrmat doesn't support commenting within the format string, unfortunately.
As you say, you would not want to handle that in your php with locale awareness or meta phrases.
So what we had was a master phrase table:
phraseid, english, supplemental info
and a localized table:
phraseid, localeid, translation
You've also assumed with INNER JOINS that the localized versions exist - we tended to leave them out until they were translated, so that query of yours would end up returning nothing at first (not even the default)
If a translation didn't exist, ours defaulted to English, then fellback to code-provided (in case the database didn't have the ID, and it was also clear from the code what phrase identifier "TXT_LNG_WRNNG_INV_LOW" was actually trying to get) - so the equivalent of this query is what we used:
SELECT COALESCE(localized.translation, phrase.english, @fallback)
FROM DUAL
LEFT JOIN phrase
ON phrase.phraseid = @phraseid
LEFT JOIN localized
ON localized.phraseid = phrase.phraseid
AND localized.localeid = @localeid
Obviously, you might get all the things at one time using your page system.
We tended not to link things to the page because they were re-used a lot between pages (and not just in page-fragments or controls), but that's certainly fine.
In the case of our Windows-native apps, we used reflection and a mapping file from control to translation tag so that translation did not require re-compiles (in pre-.NET apps we had to tag the controls using the Tag or other special properties). This is probably a little more problematic in PHP or ASP.NET MVC, but possible in ASP.NET where there is a full-blown server-side page model.
For testing, you can obviously query to find missing translations very easily. To find places which need to be tagged, translate the entire phrase dictionary using pig-latin or Klingon or something like replace every non-space character with ? - the English should stand out and let you know that some naked plaintext has crept into your HTML.
Best Answer
If I understand your requirements correctly....
I would just use the natural key, LanguageCode-CultureCode ("en-US," for example). It's small enough. (I'm using the entire "en-US" as the primary key to differentiate it from "en-GB," for example.)
This should allow you to get all articles in en or sv, and you can also query for en-US or sv-SE. Presumably, although it wasn't in your example, you could also query for en-CA, fr-CA, en, fr, or CA.
Edit--I'm sorry, you're right, no search by Culture in my old code. Revamped above, sorry. Here's a sample of content:
Language:
Language_Text:
Searching by Culture (Canada):
Searching by Language (French):
Searching by LanguageCode (Swedish):