Postgresql – Infix search on large table when majority of rows can be filtered out

optimizationperformancepostgresqlpostgresql-performance

We have a Postgres table, which looks like this:

+-----------+---------+---------+---------+
| CLIENT_ID | VALUE_1 | VALUE_2 | VALUE_3 |
+-----------+---------+---------+---------+
|     11234 | aabcdef |   mnfng |  lflgbm |
+-----------+---------+---------+---------+
|     11234 |   xdfef | nfnvnaz | fngnnva |
+-----------+---------+---------+---------+
|     84590 |  pflghh |   otngp | cignral |
+-----------+---------+---------+---------+
|    ...    |   ...   |   ...   |   ...   |

CLIENT_ID is an indexed column.

We need to run a query which will return all the rows that belong to a specific CLIENT_ID and will have a particular substring in some of the columns VALUE_X.

Naive query for client id 11234 and substring ng would look like:

SELECT * FROM tbl WHERE client_id = '11234' AND (value_1 LIKE '%ng%' OR
value_2 LIKE '%ng%' OR value_3 LIKE '%ng%')

The problem is that our table is large. It can possibly contain 10+ millions of rows. Rows can be first filtered by CLIENT_ID – then the subset of rows matching will only have about 100k+ rows.

Our tests show that when the table is relatively small, the database does a sequential scan of the whole table. It's not very fast, but it's reasonably fast (about 1s). When the table gets larger, the database first filters the rows with a specific CLIENT_ID (which is indexed) and then does sequential scan of the rest. This is super slow (about 30s+).

Is there any way how we can speed this up, only by using Postgres (we don't want to introduce Elasticsearch etc.)?

We are currently using Postgres 10.9 in production, but we can possibly upgrade to the latest version, if there is significant performance gain (our staging tests show there are not).

Thank you.

Note that there are already threads on speeding up infix searches, but this case is more specific – we can possibly filter many of the rows out otherwise.

Best Answer

Thank you all for your input. My colleague just solved the problem by using Trigram index on the VALUE_N columns. Now the queries run under 10ms.