Postgresql – Best way to model a string matrix (postgreSQL)

database-designpostgresql

My goal is to make this as convenient as possible (i.e. least external tables lookups in application) and efficient (least redundant indexes)

I need to model data that, without any optimization, would ideally look like:

.______________________________.
|   en-US   |  pt-BR   | zn-CN | ...
|===========|==========|=======|
|university | academia |  大学 |
| self-made |  NULL    |  自制 |
|  NULL     | saudade  |  思   |
'-----------'----------'-------'

the main application use case is to fetch words from one locale and find the other locale.

The caveats are:

I do not know the number of locales. Ideally they can be created by the application on runtime, but i am fine making a new locale be a Ops event that require downtime and alter table.
There is no "main" locale.
If we extract terms into a locale, word table, there is the problem of A,B == B,A, but we can work around this by having a logic on normalizing B,A as A,B to avoid N^N relations. But would be nice if the schema does not have this problem at all.
bonus: i'd like to avoid repeating the words/indexes all over, as is done with gnu gettext data files.

ideally queries would be as convenient as select en-US from words where zn-CN='nihao'

was playing with some alternatives on db<>fiddle](https://dbfiddle.uk/?rdbms=postgres_12&fiddle=b22dddd5e7b70cdf9f61500ef076cda9), but none are even close to optimal.

create table A (
  locale text NOT NULL,
  word text NOT NULL,
  primary key (locale, word)
);

✓

insert into A values
('en-US', 'hi'),
('zn-CN', 'nihao'),
('pt-BR', 'oi');

select * from A limit 5;

3 rows affected

locale | word 
:----- | :----
en-US  | hi   
zn-CN  | nihao
pt-BR  | oi

create table B (
  locale text NOT NULL,
  word text NOT NULL,
  locale2 text NOT NULL,
  word2 text NOT NULL,
  PRIMARY KEY(locale, word, locale2, word2),
  FOREIGN KEY(locale,word) REFERENCES A(locale,word),
  FOREIGN KEY(locale2,word2) REFERENCES A(locale,word)
)

✓

table B would be noisy, but would allow queries close to the ideal. But i am afraid to even start looking if the engine indexing would be optimal with just those keys.

-- only 3 words is already a mess without application normalization logic
insert into B values
('en-US', 'hi', 'pt-BR', 'oi'),
('pt-BR', 'oi', 'en-US', 'hi'),
('en-US', 'hi', 'zn-CN', 'nihao'),
('pt-BR', 'oi', 'zn-CN', 'nihao'),
('zn-CN', 'nihao', 'en-US', 'hi'),
('zn-CN', 'nihao', 'pt-BR', 'oi');

select * from B limit 10;

6 rows affected

locale | word  | locale2 | word2
:----- | :---- | :------ | :----
en-US  | hi    | pt-BR   | oi   
pt-BR  | oi    | en-US   | hi   
en-US  | hi    | zn-CN   | nihao
pt-BR  | oi    | zn-CN   | nihao
zn-CN  | nihao | en-US   | hi   
zn-CN  | nihao | pt-BR   | oi

     create table C (
       locale text NOT NULL,
       word text NOT NULL,
       locale2 text NOT NULL,
       word2 text NOT NULL,
       FOREIGN KEY(locale,word) REFERENCES A(locale,word),
       FOREIGN KEY(locale2,word2) REFERENCES A(locale,word)
     );

✓

- we can probably normalize somehow by selecting precedence for locales, e.g. en-US < pt-BR < zn-CN
insert into C values
('en-US', 'hi', 'pt-BR', 'oi'),
('en-US', 'hi', 'zn-CN', 'nihao'),
('pt-BR', 'oi', 'zn-CN', 'nihao');

3 rows affected

select word2 as translation from C where locale='en-US' and word='hi' and locale2='zn-CN';

| translation |
| :---------- |
| nihao       |

-- application must search on other side if locale normalization says so
select word as translation from C where locale2='zn-CN' and word2='nihao' and locale ='en-US';

| translation |
| :---------- |
| hi          |

…and I didn't even started to look if the engine will optimize the text keys properly when they are repeated all over the relation tables.

the naive indexed schema

create table A (
  id SERIAL PRIMARY KEY,
  locale text NOT NULL,
  word text NOT NULL,
  UNIQUE  (locale, word)
);

insert into A (locale, word) values 
('en-US', 'hi'),
('zn-CN', 'nihao'),
('pt-BR', 'oi');
select * from A limit 5;

id    locale  word
1     en-US   hi
2     zn-CN   nihao
3     pt-BR   oi

create table B (
  id serial PRIMARY KEY,
  word1 serial references A(id),
  word2 serial references A(id),
  UNIQUE (word1,word2),
  UNIQUE (word2, word1) )

insert into B (word1, word2) values
(1, 2),
(1, 3);
select * from B;

id    word1   word2
1     1   2
2     1   3

But the queries are not too convenient

select word from A where locale='en-US' and id in (
   select word1 from B where word2 = (
      select id from A where locale='zn-CN' and word='nihao')
  UNION
   select word2 from B where word1 = (
      select id from A where locale='zn-CN' and word='nihao')
);

index mises everywhere

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=1e83c68b2f90f883340b1d24bb1e2a36

Best Answer

Plain list id -- locale_id -- word is good enough.
INT UNSIGNED id should be used as PK for performance as far as string search for join is slow.
Relation of equivalence can be represented by M:N table id1 -- id2 where each pair is unique
Commutative duplicates can be avoided by constraint (CHECK id1 < id2)
The whole nest for all locales can be fetched by single recursive CTE - even if there is no full relation graph declared. Say if A=B and B=C and C=D and there is no A=D, you'll get (A,B,C,D) as a result of CALL nest(A)

Then you can use something like that:
CALL convert_word('hello', 'en_US', 'zn_CN');

Related Solutions

Fact table foreign keys null

You can leave the FK to some dimension tables as NULL if those dimensions are not known or not applicable. You just have to remember to use outer joins when you do your reporting query.

Alternatively, some people create a "none" and/or an "n/a" dimension record for data mart dimensions and then populate fact table FKs to point at these rather than using NULLs. People who do this like this approach because they have an aversion to outer joins.

People who use NULL FKs in fact tables usually have an aversion to people who have an version to outer joins. ;) (in other words, this is a stylistic issue which can generate religious wars)

I say do whichever you prefer, but pick one approach and stick to it fervently.

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms