Postgresql – Postgres 12 and full text search against arbitrary key-value pairs

amazon-rdsfull-text-searchpostgresql

I was hoping to get some advice or guidance around a problem I'm having.

We currently store event data in Postgres 12 (AWS RDS) – this data could contain anything. To reduce the amount of data (alot of keys for example are common across all events) we flatten this data and store it across 3 tables –

event_keys – the key names from events
event_values – the values from events
event_key_values – a lookup table, containing the event_id, and key_id and value_id.

First inserting the key and value (or returning the existing id), and finally storing the ids in the event_key_values table. So 2 simple events such as

[
  {
    "event_id": 1,
    "name": "updates",
    "meta": {
      "id": 1,
      "value: "some random value"
    }
  },
  {
    "event_id": 2,
    "meta": {
      "id": 2,
      "value: "some random value"
    }
  }
]

would become

event_keys

id  key
1   name
2   meta.id
3   meta.value

event_values

id  value
1   updates
2   1
3   some random value
4   2

event_key_values

event_id  key_id  value_id
1         1       1
1         2       2
1         3       3
2         2       4
2         3       3

All values are converted to text before storing, and a GIN index has been added to the event_key and event_values tables.

When attempting to search this data, we are able to retrieve results, however once we start hitting 1 million or more rows (we are expecting billions!) this can take anywhere from 10 seconds too minutes to find data. The key-values could have multiple search operations applied to them – equality, contains (case-sensitive and case-insensitive) and regex. To complicate things a bit more, the user can also search against all events, or a filtered selection (so only search against the last 10 days, events belonging to a certain application etc).

Some things I have noticed from testing

  • searching with multiple WHERE conditions on the same key e.g meta.id, the GIN index is used. However, a WHERE condition with multiple keys does not hit the index.
  • searching with multiple WHERE conditions on both the event_keys and event_values table does not hit the GIN index.
  • using 'raw' SQL – we use Jooq in this project and this was to rule out any issues caused by it's SQL generation.

I have tried a few things

  • denormalising the data and storing everything in one table – however this resulted in the database (200 GB disk) becoming filled within a few hours, with the index taking up more space than the data.
  • storing the key-values as a JSONB value against an event_id, the JSON blob containing the flattened key-value pairs as a map – this had the same issues as above, with the index taking up 1.5 times the space as the data.
  • building a document from the available key-values using concatenation using both a sub-query and CTE – from testing with a few million rows this takes forever, even when attempting to tune some parameters such as work_mem!

From reading solutions and examples here, it seems full text search provides the most benefits and performance when applied against known columns e.g. a table with first_name, last_name and a GIN index against these two columns, but I hope I am wrong. I don't believe the JOINs across tables is an issue, or event_values needing to be stored in the TOAST storage due to the size to be an issue (I have tried with truncated test values, all of the same length, 128 chars and the results still take 60+ seconds).

From running EXPLAIN ANALYSE it appears no matter how I tweak the queries or tables, most of the time is spent searching the tables sequentially.

Am I simply spending time trying to make Postgres and full text search suit a problem it may never work (or at least have acceptable performance) for? Or should I look at other solutions e.g. One possible advantage of the data is it is 'immutable' and never updated once persisted, so something syncing the data to something like Elasticsearch and running search queries against it first might be a solution.

I would really like to use Postgres for this as I've seen it is possible, and read several articles where fantastic performance has been achieved – but maybe this data just isn't suited?

Edit;

Due to the size of the values (some of these could be large JSON blobs, several 100Kbs), the GIN index on event_values is based on the MD5 hash – for equality checks the index is used but never for searching as expected. For event_keys the GIN index is against the key_name column. Users can search against key names, values or both, for example "List all event keys beginning with 'meta.hashes'"

Best Answer

You may be setting up indexes wrong. Indexes need to be customized to the operations you do with them. One index may be great for one query but useless for another. A GIN on MD5 (or any other hash) is only good for equality. If you wish to search in multiple ways you may need multiple indexes.

Also, having too much flexibility makes things a lot harder. You will benefit from restricting and structuring the data. The more you know about what data you will have and what you will do with it, the more you can optimize your database. If "this data could contain anything" is all you have then you will never get great performance, no matter what database you use or how you set it up.

For starters, if some keys exist on all/most events you should map that key a column in an events table.

CREATE TABLE events (
    event_id         serial PRIMARY KEY,
    application_name text, -- This "key" is on most events.
    created_date     timestamptz NOT NULL, -- This "key" is on all events.
    ...
);

This has two benefits, 1 the key does not need to be joined or stored over and over, and 2 the value could be stored and indexed much more efficiently.

Now a query like "events from web app in the last 10 days" are very easy to index and optimize.

SELECT *
  FROM events 
 WHERE application_name = 'web-app'
   AND created_date > now() + interval '10 days ago';