My goal is to make this as convenient as possible (i.e. least external tables lookups in application) and efficient (least redundant indexes)
I need to model data that, without any optimization, would ideally look like:
.______________________________.
| en-US | pt-BR | zn-CN | ...
|===========|==========|=======|
|university | academia | 大学 |
| self-made | NULL | 自制 |
| NULL | saudade | 思 |
'-----------'----------'-------'
the main application use case is to fetch words from one locale and find the other locale.
The caveats are:
- I do not know the number of locales. Ideally they can be created by the application on runtime, but i am fine making a new locale be a Ops event that require downtime and alter table.
- There is no "main" locale.
- If we extract terms into a
locale, word
table, there is the problem ofA,B == B,A
, but we can work around this by having a logic on normalizing B,A as A,B to avoid N^N relations. But would be nice if the schema does not have this problem at all. - bonus: i'd like to avoid repeating the words/indexes all over, as is done with gnu gettext data files.
ideally queries would be as convenient as select en-US from words where zn-CN='nihao'
was playing with some alternatives on db<>fiddle](https://dbfiddle.uk/?rdbms=postgres_12&fiddle=b22dddd5e7b70cdf9f61500ef076cda9), but none are even close to optimal.
create table A ( locale text NOT NULL, word text NOT NULL, primary key (locale, word) );
✓
insert into A values ('en-US', 'hi'), ('zn-CN', 'nihao'), ('pt-BR', 'oi'); select * from A limit 5;
3 rows affected locale | word :----- | :---- en-US | hi zn-CN | nihao pt-BR | oi
create table B ( locale text NOT NULL, word text NOT NULL, locale2 text NOT NULL, word2 text NOT NULL, PRIMARY KEY(locale, word, locale2, word2), FOREIGN KEY(locale,word) REFERENCES A(locale,word), FOREIGN KEY(locale2,word2) REFERENCES A(locale,word) )
✓
table B would be noisy, but would allow queries close to the ideal. But i am afraid to even start looking if the engine indexing would be optimal with just those keys.
-- only 3 words is already a mess without application normalization logic insert into B values ('en-US', 'hi', 'pt-BR', 'oi'), ('pt-BR', 'oi', 'en-US', 'hi'), ('en-US', 'hi', 'zn-CN', 'nihao'), ('pt-BR', 'oi', 'zn-CN', 'nihao'), ('zn-CN', 'nihao', 'en-US', 'hi'), ('zn-CN', 'nihao', 'pt-BR', 'oi'); select * from B limit 10;
6 rows affected locale | word | locale2 | word2 :----- | :---- | :------ | :---- en-US | hi | pt-BR | oi pt-BR | oi | en-US | hi en-US | hi | zn-CN | nihao pt-BR | oi | zn-CN | nihao zn-CN | nihao | en-US | hi zn-CN | nihao | pt-BR | oi
create table C ( locale text NOT NULL, word text NOT NULL, locale2 text NOT NULL, word2 text NOT NULL, FOREIGN KEY(locale,word) REFERENCES A(locale,word), FOREIGN KEY(locale2,word2) REFERENCES A(locale,word) );
✓
- we can probably normalize somehow by selecting precedence for locales, e.g. en-US < pt-BR < zn-CN insert into C values ('en-US', 'hi', 'pt-BR', 'oi'), ('en-US', 'hi', 'zn-CN', 'nihao'), ('pt-BR', 'oi', 'zn-CN', 'nihao');
3 rows affected
select word2 as translation from C where locale='en-US' and word='hi' and locale2='zn-CN';
| translation | | :---------- | | nihao |
-- application must search on other side if locale normalization says so select word as translation from C where locale2='zn-CN' and word2='nihao' and locale ='en-US';
| translation | | :---------- | | hi |
…and I didn't even started to look if the engine will optimize the text keys properly when they are repeated all over the relation tables.
the naive indexed schema
create table A (
id SERIAL PRIMARY KEY,
locale text NOT NULL,
word text NOT NULL,
UNIQUE (locale, word)
);
insert into A (locale, word) values
('en-US', 'hi'),
('zn-CN', 'nihao'),
('pt-BR', 'oi');
select * from A limit 5;
id locale word 1 en-US hi 2 zn-CN nihao 3 pt-BR oi
create table B (
id serial PRIMARY KEY,
word1 serial references A(id),
word2 serial references A(id),
UNIQUE (word1,word2),
UNIQUE (word2, word1) )
insert into B (word1, word2) values
(1, 2),
(1, 3);
select * from B;
id word1 word2 1 1 2 2 1 3
But the queries are not too convenient
select word from A where locale='en-US' and id in ( select word1 from B where word2 = ( select id from A where locale='zn-CN' and word='nihao') UNION select word2 from B where word1 = ( select id from A where locale='zn-CN' and word='nihao') );
index mises everywhere
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=1e83c68b2f90f883340b1d24bb1e2a36
Best Answer
id -- locale_id -- word
is good enough.INT UNSIGNED id
should be used as PK for performance as far as string search for join is slow.id1 -- id2
where each pair is unique(CHECK id1 < id2)
CALL nest(A)
Then you can use something like that:
CALL convert_word('hello', 'en_US', 'zn_CN')
;