Postgresql – How to implement full-text search on a single value in a JSONB column

full-text-searchindexjsonpostgresql

I have a table that stores a conversation between two people.

The data will look something like this:

CREATE TABLE foo
AS
  SELECT $$[
    { "user": 1, "timestamp": 1, "message": "First message" },
    { "user": 2, "timestamp": 2, "message": "Second message" },
    { "user": 2, "timestamp": 3, "message": "Debounced message from same user" },
    { "user": 1, "timestamp": 4, "message": "Last message" }
  ]$$::jsonb AS jsondata;

I never need to look up each message individually, so I just want to store the whole conversation in a single jsonb field. I need to perform a full-text search across all of the messages.

My first thought was to create a new text column, concat all of the messages into one long string, and create a trigram GIN index on that column.

That seems like a hack that wastes a lot of space, so I would like to avoid the intermediate column. How can I create the index directly from the jsonb column?

Best Answer

The way I read this question, you only care about message. The difficulty here is that you need to,

  1. map over a json array returning the message element
  2. reduce/fold the array of message element strings to an aggregate string.

This is easy in functional programming. It's not as easy with the stock functions in PostgreSQL, and it'd be difficult to make it work with a declarative language. Maybe one day you'll have a jsonb_array_elements(jsonb [,path]) which will get you by but until then we can create a function in our database.

Creating a function with plpgsql

Note this probably isn't as a fast nor as clean as a plv8 function, but in the next revision we'll return a tsvector.

Here we use jsonb_array_elements to expand the json, and then aggregate back the 'message' elements into a string.

CREATE OR REPLACE FUNCTION jsonb_message_to_string( jsondata jsonb, out string text )
AS $func$
  BEGIN
    SELECT INTO string
      string_agg(d->>'message', ' ')
    FROM jsonb_array_elements(jsondata) AS d;
    RETURN;
  END;
$func$ LANGUAGE plpgsql
IMMUTABLE;

Creating tsvector_agg and improving our function.

This function is not yet optimal though because it's returning a string. However, there is a second difficulty in that as of 9.6 PostgreSQL does not yet ship with a tsvector_agg; but, it's PostgreSQL so we can make one..

CREATE AGGREGATE tsvector_agg (tsvector) (
  SFUNC = tsvector_concat,
  STYPE = tsvector
);

This permits us to now return an aggregate tsvector which is faster and retains positional information. Now we can improve our function. Here we create a new jsonb_message_to_tsvector.

CREATE OR REPLACE FUNCTION jsonb_message_to_tsvector( jsondata jsonb, out tsv tsvector )
AS $func$
  BEGIN
    SELECT INTO tsv
      tsvector_agg(to_tsvector(d->>'message'))
    FROM jsonb_array_elements(jsondata) AS d;
    RETURN;
  END;
$func$ LANGUAGE plpgsql
IMMUTABLE;

Now we can create our index..

CREATE INDEX ON FOO
  USING gin (jsonb_message_to_tsvector(jsondata));

And we would query it like so..

SELECT jsonb_message_to_tsvector(jsondata) @@ 'first'
FROM foo;