Postgresql – Best way to model a string matrix (postgreSQL)

database-designpostgresql

My goal is to make this as convenient as possible (i.e. least external tables lookups in application) and efficient (least redundant indexes)

I need to model data that, without any optimization, would ideally look like:

.______________________________.
|   en-US   |  pt-BR   | zn-CN | ...
|===========|==========|=======|
|university | academia |  大学 |
| self-made |  NULL    |  自制 |
|  NULL     | saudade  |  思   |
'-----------'----------'-------'

the main application use case is to fetch words from one locale and find the other locale.

The caveats are:

  • I do not know the number of locales. Ideally they can be created by the application on runtime, but i am fine making a new locale be a Ops event that require downtime and alter table.
  • There is no "main" locale.
  • If we extract terms into a locale, word table, there is the problem of A,B == B,A, but we can work around this by having a logic on normalizing B,A as A,B to avoid N^N relations. But would be nice if the schema does not have this problem at all.
  • bonus: i'd like to avoid repeating the words/indexes all over, as is done with gnu gettext data files.

ideally queries would be as convenient as select en-US from words where zn-CN='nihao'

was playing with some alternatives on db<>fiddle](https://dbfiddle.uk/?rdbms=postgres_12&fiddle=b22dddd5e7b70cdf9f61500ef076cda9), but none are even close to optimal.

create table A (
  locale text NOT NULL,
  word text NOT NULL,
  primary key (locale, word)
);
insert into A values
('en-US', 'hi'),
('zn-CN', 'nihao'),
('pt-BR', 'oi');

select * from A limit 5;
3 rows affected

locale | word 
:----- | :----
en-US  | hi   
zn-CN  | nihao
pt-BR  | oi   
create table B (
  locale text NOT NULL,
  word text NOT NULL,
  locale2 text NOT NULL,
  word2 text NOT NULL,
  PRIMARY KEY(locale, word, locale2, word2),
  FOREIGN KEY(locale,word) REFERENCES A(locale,word),
  FOREIGN KEY(locale2,word2) REFERENCES A(locale,word)
)

table B would be noisy, but would allow queries close to the ideal. But i am afraid to even start looking if the engine indexing would be optimal with just those keys.

-- only 3 words is already a mess without application normalization logic
insert into B values
('en-US', 'hi', 'pt-BR', 'oi'),
('pt-BR', 'oi', 'en-US', 'hi'),
('en-US', 'hi', 'zn-CN', 'nihao'),
('pt-BR', 'oi', 'zn-CN', 'nihao'),
('zn-CN', 'nihao', 'en-US', 'hi'),
('zn-CN', 'nihao', 'pt-BR', 'oi');

select * from B limit 10;
6 rows affected

locale | word  | locale2 | word2
:----- | :---- | :------ | :----
en-US  | hi    | pt-BR   | oi   
pt-BR  | oi    | en-US   | hi   
en-US  | hi    | zn-CN   | nihao
pt-BR  | oi    | zn-CN   | nihao
zn-CN  | nihao | en-US   | hi   
zn-CN  | nihao | pt-BR   | oi   
     create table C (
       locale text NOT NULL,
       word text NOT NULL,
       locale2 text NOT NULL,
       word2 text NOT NULL,
       FOREIGN KEY(locale,word) REFERENCES A(locale,word),
       FOREIGN KEY(locale2,word2) REFERENCES A(locale,word)
     );
- we can probably normalize somehow by selecting precedence for locales, e.g. en-US < pt-BR < zn-CN
insert into C values
('en-US', 'hi', 'pt-BR', 'oi'),
('en-US', 'hi', 'zn-CN', 'nihao'),
('pt-BR', 'oi', 'zn-CN', 'nihao');
3 rows affected
select word2 as translation from C where locale='en-US' and word='hi' and locale2='zn-CN'; 
| translation |
| :---------- |
| nihao       |
-- application must search on other side if locale normalization says so
select word as translation from C where locale2='zn-CN' and word2='nihao' and locale ='en-US';
| translation |
| :---------- |
| hi          |

…and I didn't even started to look if the engine will optimize the text keys properly when they are repeated all over the relation tables.

the naive indexed schema

create table A (
  id SERIAL PRIMARY KEY,
  locale text NOT NULL,
  word text NOT NULL,
  UNIQUE  (locale, word)
);
insert into A (locale, word) values 
('en-US', 'hi'),
('zn-CN', 'nihao'),
('pt-BR', 'oi');
select * from A limit 5;
id    locale  word
1     en-US   hi
2     zn-CN   nihao
3     pt-BR   oi
create table B (
  id serial PRIMARY KEY,
  word1 serial references A(id),
  word2 serial references A(id),
  UNIQUE (word1,word2),
  UNIQUE (word2, word1) )

insert into B (word1, word2) values
(1, 2),
(1, 3);
select * from B;
id    word1   word2
1     1   2
2     1   3

But the queries are not too convenient

select word from A where locale='en-US' and id in (
   select word1 from B where word2 = (
      select id from A where locale='zn-CN' and word='nihao')
  UNION
   select word2 from B where word1 = (
      select id from A where locale='zn-CN' and word='nihao')
);

index mises everywhere

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=1e83c68b2f90f883340b1d24bb1e2a36

Best Answer

  • Plain list id -- locale_id -- word is good enough.
  • INT UNSIGNED id should be used as PK for performance as far as string search for join is slow.
  • Relation of equivalence can be represented by M:N table id1 -- id2 where each pair is unique
  • Commutative duplicates can be avoided by constraint (CHECK id1 < id2)
  • The whole nest for all locales can be fetched by single recursive CTE - even if there is no full relation graph declared. Say if A=B and B=C and C=D and there is no A=D, you'll get (A,B,C,D) as a result of CALL nest(A)

Then you can use something like that:
CALL convert_word('hello', 'en_US', 'zn_CN');