PostgreSQL (big data): search by tags and order by timestamp

indexpostgresqlpostgresql-11postgresql-performance

We need to add a search feature on a large table (200M+ rows):

item_id | tags                          | created_at          | ...
-------------------------------------------------------------------
1       | ['tag1', 'bar2']              | 2020-01-06 12:43:32 |
2       | ['example5', 'tag9', 'foo2']  | 2020-01-10 10:40:00 |
3       | ['test1', 'tag5']             | 2020-01-11 12:43:32 |
...

The queries would be similar to this one:

SELECT * FROM items 
WHERE tags @> ARRAY['t2', 't5']::varchar[]
ORDER BY created_at DESC
LIMIT 100;

Basically it's like searching some logs by tags and ordering them by timestamp. Seems a common scenario…

What index should we use? Have you ever tested something similar in production?

  • Example 1: create a GIN index on tags. The problem is that the search may return millions of results and in order to apply order / limit you need to make millions of reads from the table on disk (in order to get the created_at value for each row).
  • Example 2: add the btree_gin extension and create a composite index on created_at and tags. The problem is the same as above: I think that PostgreSQL cannot use ordering since the index is declared as a GIN index and not as a btree.
  • Example 3: create a btree index on created_at and tags. PostgreSQL needs to scan the whole index, since btree doesn't support array operators. I also fear that due to the SELECT * PostgreSQL will not use an index-only scan, thus resulting in millions of reads from disk (that would be actually useless since it only needs 100 reads from disk).

Best Answer

There are two approaches:

  1. create an index on the array:

    CREATE INDEX ON items USING gin (tags);
    

    That allows the database to quickly find the matching rows, but then it has to perform a top-n sort.

  2. create a B-tree index on created_at:

    CREATE INDEX ON items (created_at);
    

    That will allow the database to avoid the sort, but it has to scan and discard the rows that don't match the condition.

Unfortunately, the two strategies are mutually exclusive, and which is best depends on the data. You'll have to experiment.