Postgresql – Indexing a string by a subset of its characters (i.e.: the Kanji ones)

indexpostgresqlvarchar

I have a database in PostgreSQL modelling a Japanese->English dictionary. The tables I am interested in are listed as follows, where bold represents PK and italic FK:

  • entries(entry_id int, keyword varchar, reading varchar, meaning varchar)
  • entries_kanji(kanji char, entry_id int) where entry_id references entries.entry_id

The idea is that entries.keyword is a string containing several kanji. For each of these kanji, there will be an appropriate row in entries_kanji linking back to the entry. This allows me to quickly find all the entries whose keywords contain specific kanji, and by computing the intersection of several queries on this table, identify a word by its kanji alone.

At present, I am using a script external to the database to generate the entries_kanji table by iterating over every row in entries, extracting the kanji from each keyword and inserting a row into entries_kanji. While this works, it is inconvenient to have to run this script every time a new entry is added to the dictionary.

My question is: is there a way to achieve the above using an index alone? I know PostgreSQL allows for indexes to be created based on expressions, but I don't know how to achieve what I described above using this. Or is there some other feature of PostgreSQL that will allow me to do away with the script?

Best Answer

There is a way of implementing what you want by using Regular Expressions and a couple of triggers.

To set the stage, I assume your tables are created this way:

CREATE TABLE entries
(
    entry_id int PRIMARY KEY, 
    keyword varchar, 
    reading varchar, 
    meaning varchar
) ;

CREATE TABLE  entries_kanji
(
    kanji nchar, 
    entry_id int REFERENCES entries (entry_id)
) ;

I assume that your definition of Kanji can be translated to a Regular Expression. To begin with, I use the definition of Kanji found on Localizing Japan. That is, we can define one funcion to check whether a character is a Kanji by doing this:

CREATE OR REPLACE FUNCTION is_kanji(IN a_char character) 
RETURNS boolean AS
$BODY$
    SELECT
        a_char ~ '[\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]' 
$BODY$
LANGUAGE sql 
IMMUTABLE 
STRICT;

The ~ operator means text_on_the_left matches the regular expression on the right. See Pattern Matching in PostgreSQL.

The result of executing the following query

SELECT 
    is_kanji('齢');

is true.

Next step is, given a sentence (a string where there are several Kanji chars), we SELECT all of them. This can be done with the following query:

SELECT
    unnest(regexp_matches('答 answer 電 electricity 教 teach', 
      '[\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]', 'gi')) AS kanji

The function regexp_matches returns an array containing all the parts of the passed string that match the regular expression. This array is converted into a set of rows by means of the unnest function.

Once we have this, we create one trigger function to be associated with INSERTS, UPDATES and DELETES into the entries table.

This function would take care:

CREATE FUNCTION trigger_entry() 
RETURNS trigger AS $$
BEGIN
    if TG_OP = 'DELETE' or TG_OP = 'UPDATE' then
        -- When we delete or update rows, we delete all 
        -- corresponding entries_kanji.
        DELETE FROM
            entries_kanji 
        WHERE
            entry_id = OLD.entry_id ;    
    end if ;
    if TG_OP = 'INSERT' or TG_OP = 'UPDATE' then
        -- When we insert or update rows, we insert all 
        -- corresponding entries_kanji
        INSERT INTO
            entries_kanji (kanji, entry_id)
        SELECT
             unnest(regexp_matches(NEW.keyword, 
                 '[\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]', 'g')) AS kanji,
             new.entry_id ;
        -- If case we use a 'before' trigger... we return NEW row (untouched)
        return new ;
    end if ;

    -- In case we use a 'before' trigger, ... we return OLD row (untouched).
    return old ;
END $$
LANGUAGE plpgsql 
VOLATILE ;

And we associate this trigger function with one trigger (I have chosen it to be an AFTER trigger, in case some other BEFORE triggers do some data manipulation; otherwise, it could also be a BEFORE trigger):

CREATE TRIGGER trigger_make_entries_kanji 
    AFTER INSERT OR UPDATE OF entry_id, keyword OR DELETE
    ON entries 
    FOR EACH ROW
    EXECUTE PROCEDURE trigger_entry();

With this setup, you can now insert data into entries, and get the kanji in entries_kanji:

INSERT INTO 
    entries (entry_id, keyword, reading, meaning)
VALUES
    (1, 'One sentence with one kanji(朗) symbol', 'reading_1', 'meaning_1'),
    (2, '答 answer 電 electricity 教 teach', 'reading_2', 'meaning_2');

and see that you get the proper information in the entries_kanji table:

SELECT
    *
FROM
    entries_kanji 
ORDER BY
    entry_id, kanji ;

You get

+-------+----------+ 
| kanji | entry_id |
+-------+----------+  
| 朗    |        1 |  
| 教    |        2 |  
| 答    |        2 |  
| 電    |        2 |  
+-------+----------+  

You can also check that the trigger takes care when you UPDATE or DELETE entries.

The advantages of triggers versus materialized views are that

  1. They change the information in real time, without need for refreshing
  2. They are available also in older versions of PostgreSQL (which might be important if version is out of your control)

The disadvantages are:

  1. You (or someone in your team) need to have some knowledge on how to program triggers. Triggers can have some nasty side-effects if not used properly.
  2. You carry some extra cost when inserting, updating or deleting your table's data.

Depending on your use case, you can decide which solution is more appropriate.