PostgreSQL – Index Trigger/Hook for Synonyms

full-text-searchgin-indexpostgresqlsynonymstrigger

Is it possible, in PostgreSQL, to add a trigger to modify data prior to it going into a full text index?

PostgreSQL requires using files to add custom dictionaries for things like stop words, synonyms, etc. However, on most managed database hosting file access is not allowed. What would be best is if I could use a table (such as a spaced single or double column) instead of a file dictionary to supply the list of synonyms, stop words, etc. but after a lot of searching it does not appear that is possible.

Another path I tried to look into was to populate a table with synonyms then export it to the file system, however PostgreSQL requires a full path to the file. On managed hosting you cannot provide a full path if you do not know the the paths to begin with.

So now I am down to seeing if there is some way I can "hook" into the pre-index process and modify the data myself prior to it being added to an index. Note that I thought about using a trigger and modifying the data directly but that would also modify the data going into the stored row's field – something I don't want to do. What I am looking for is a way to leave the original data / field intact and being entered into a field while modifying the information prior to it being indexed. I thought about adding an additional column to the table with the modified data, however that can get very messy and may require a lot of extra columns for larger data sets. Another option is using materialized views or inherited/child tables but, once again, would be a messy and difficult to manage outcome.

TLDR: I am asking if there is any viable way to define stop words, synonyms, etc. without having to use files (so that it can be used on managed systems that do not provide file access)?

Best Answer

It's all preprocessing. PostgreSQL can do it's thing. You can do yours.

CREATE INDEX ON myTable 
  USING gin ( myTransform(x) );

SELECT *
FROM myTable
WHERE 'foo & bar' @@ myTransform(x);

Exercise left to the OP,

CREATE FUNCTION myTransform ( x string )
RETURNS tsvector AS $$
  BEGIN
    -- modify x
    RETURN x::tsvector;
  END;
$$ LANGUAGE plpgsql
IMMUTABLE;

Related Solutions

PostgreSQL pl/perl trigger, differentiate null vs empty

You did not specify PostgreSQL version. On 9.0.5 I have observed the same behavior (not sure if it's a bug, see comments below).

This is easy to workaround - you can first test for definedness to catch NULL, and if it passed, test for empty string.

if ( not defined $_TD->{ new }{ $column } ) {
    elog( NOTICE, "New value in column $column is not defined" );
}
elsif ( $_TD->{ new }{ $column } eq '' ) {
    elog( NOTICE, "New value in column $column is empty" );
}
if ( not defined $_TD->{ old }{ $column } ) {
    elog( NOTICE, "Old value in column $column is not defined" );
}
elsif ( $_TD->{ old }{ $column } eq '' ) {
    elog( NOTICE, "Old value in column $column is empty" );
}

Oracle Text CTXCAT Domain Index – Transactional and Before Insert/Update Trigger

Aha! I've found the answer. Talk about an edge case.

First, I found this post from 2007, where someone says:

...the AFTER trigger for synchronizing the CTXCAT index on [column] is not firing (since my update statement does not include the indexed column).

...

Oracle, would it not be better to generate the CTXCAT trigger to examine the :old and :new values in the indexed column, rather than using a check on (if updating([column]))?

Over a year later, someone replied:

I fixed this issue by adding additional UPDATING (first_name and last_name) conditions in the DR$table_name trigger.

Here is part of the modified trigger.

if (inserting or updating('LAST_NAME_FIRST') or updating('FIRST_NAME') or updating('LAST_NAME')) then reindex := TRUE;

Hm, so the CTXCAT index uses a trigger to know when it needs to update the index for a particular entry. All I need to do is tweak the trigger and recompile it so it does what I want.

To get the content of the trigger:

SELECT text
  FROM   user_source
  WHERE  name = 'DR$NAMES_IDXTC'
  AND    type = 'TRIGGER'
  ORDER  BY line;

I copied this into Sublime, prettified it, and found this (simplified):

TRIGGER "TEST"."DR$NAMES_IDXTC" AFTER
INSERT
OR
UPDATE ON "TEST"."NAMES"
FOR EACH ROW DECLARE REINDEX boolean := FALSE;

BEGIN 

IF (inserting
    OR updating('COMPOUND_NAME')
    OR :new."COMPOUND_NAME" <> :old."COMPOUND_NAME") THEN REINDEX := TRUE;
END IF;

...

END;

You can see that the 12c Oracle Text version's AFTER trigger does actually compare the :new and :old values of the indexed column to see if it needs to update--not the case back in 2008.

So...if I'm updating the :new value in my BEFORE trigger, that should be reflected in the AFTER trigger, and the comparison would kick off an update to the index. What gives?

Well, here are the two SQL statements I was using:

update test.names set 
  first_name = 'Skye', 
  last_name = 'Fillingim'
  where ... ;

update test.names set
  first_name = null,
  last_name = null
  where ... ;

(Slightly different than what I said in my question, I apologize.)

The effect here is that, each time I used one of these statements, either the :old.compound_name or :new.compound_name would be null. So when we get to this condition:

OR :new."COMPOUND_NAME" <> :old."COMPOUND_NAME"

We are doing an equality comparison against a null, which evaluates to UNKNOWN. Hence, no index update.

This is actually an extreme edge case, because you have to be indirectly modifying :new.column via a trigger, and either :new or :old must be null. I would never have discovered it it I hadn't used those exact SQL statements.

So, we have a slight update to the trigger:

IF (inserting
    OR updating('COMPOUND_NAME')
    OR :new."COMPOUND_NAME" <> :old."COMPOUND_NAME"
    OR (:new."COMPOUND_NAME" IS NULL AND :old."COMPOUND_NAME" IS NOT NULL)
    OR (:new."COMPOUND_NAME" IS NOT NULL AND :old."COMPOUND_NAME" IS NULL)
    ) THEN REINDEX := TRUE;

And then just stick CREATE OR REPLACE in front of the old trigger, recompile it, and everything works perfectly.

Best Answer

Related Solutions

PostgreSQL pl/perl trigger, differentiate null vs empty

Oracle Text CTXCAT Domain Index – Transactional and Before Insert/Update Trigger

Related Question