Mysql – Appropriate multi-language schema, or overkill

database-designMySQL

UPDATE 2: I've actually ended up using this, and it's great after a couple tweaks. Here's my post on its actual design, and in action: http://tim.hithlonde.com/2013/lemon-schema-works/

I am building a web app, and I want it to support multiple languages. This structure has two components:

  1. Connecting locale ('english','Deutch', etc) with terms, and having a rosetta stone connecting terms, and terms in specific language.
  2. Grouping terms by page. I don't want to say, SELECT term1,term2,etc through the 30+ terms I might need on a page. I want to ask by the page they're connected to.

Here is my proposed table structure (note all the id's have relationships/indexes among them to make very efficient queries):

Schema diagram

  * locale
      * id
      * value //English, Deutch, etc//
  * terms
    * id
    * value //In English//
  * page 
    * id
    * value //Think add entry, menu//
  * page_group //group all terms to a page, for easy pulling//
    * id
    * page.id
    * term.id
  * rosetta
    * id
    * locale.id
    * term.id
    * value //french word for amount, description, etc//

This will allow queries like:

SELECT localization.value,
        terms.value
FROM localization
INNER JOIN terms ON terms.id=localization.termid
INNER JOIN page_group ON page_group.termid=localization.termid
INNER JOIN page ON page.id=page_group.pageid
INNER JOIN locale ON locale.id=localization.localeid
WHERE page.value='add_entry' AND locale.id=custlangid
ORDER BY terms.id

I only have to ask for two items; the language id that I need, and the page I need. It'll serve up all the terms, in the specified language, that are a part of the group of terms for that page.

I think this is a really good structure, but I would love some feedback.

UPDATE: To clarify, we are just talking about localization of the UI components. (labels, navigation, helpful text) All the info the user enters will be stored in unicode, not in this schema.

UPDATE 2: I've actually ended up using this, and it's great. Here's my post on it's actual design, and in action: http://tim.hithlonde.com/2013/lemon-schema-works/

Best Answer

We've done a lot of this, and (administrative) users were allowed to fix the translations live. (You still might want a caching layer, but I'm totally down with driving this with a real database and not resource files - it gives you a ton of power to query and find things which need to be translated, etc). I think your schema is probably fine, so I'll just pass on some stuff we learned in the hope that it's useful.

One thing you have left out is phrases with insertion points. In the example below, the order is reversed and the language is still English, but this could very easily be two different languages - pretend this is just two languages who normally put things in a different order.

Hello, <username> you have <x> points!

You've got <x> points to spend, <username>!

In our pre-.NET, we had a routine which did the insertion so the phrases would look like this:

Hello, {0:username} you have {1:points} points!

You've got {1:points} points to spend, {0:username}!

This would obviously simply be used in your code as <%= String.Format(phrase, username, points); %> or similar

Which helped the translator a bit. But .NET String.FOrmat doesn't support commenting within the format string, unfortunately.

As you say, you would not want to handle that in your php with locale awareness or meta phrases.

So what we had was a master phrase table:

phraseid, english, supplemental info

and a localized table:

phraseid, localeid, translation

You've also assumed with INNER JOINS that the localized versions exist - we tended to leave them out until they were translated, so that query of yours would end up returning nothing at first (not even the default)

If a translation didn't exist, ours defaulted to English, then fellback to code-provided (in case the database didn't have the ID, and it was also clear from the code what phrase identifier "TXT_LNG_WRNNG_INV_LOW" was actually trying to get) - so the equivalent of this query is what we used:

SELECT COALESCE(localized.translation, phrase.english, @fallback)
FROM DUAL
LEFT JOIN phrase
    ON phrase.phraseid = @phraseid
LEFT JOIN localized
    ON localized.phraseid = phrase.phraseid
    AND localized.localeid = @localeid

Obviously, you might get all the things at one time using your page system.

We tended not to link things to the page because they were re-used a lot between pages (and not just in page-fragments or controls), but that's certainly fine.

In the case of our Windows-native apps, we used reflection and a mapping file from control to translation tag so that translation did not require re-compiles (in pre-.NET apps we had to tag the controls using the Tag or other special properties). This is probably a little more problematic in PHP or ASP.NET MVC, but possible in ASP.NET where there is a full-blown server-side page model.

For testing, you can obviously query to find missing translations very easily. To find places which need to be tagged, translate the entire phrase dictionary using pig-latin or Klingon or something like replace every non-space character with ? - the English should stand out and let you know that some naked plaintext has crept into your HTML.