Postgresql – Full text search over multiple CPU processes

full-text-searchperformancepostgresqlpostgresql-performance

Is it possible to configure Postgres to split out a full text search over multiple CPU processes in an attempt to complete quicker?

I'm running a full text search on 2 million records on a GIN indexed tsvector column, where the source text is about 10,000 characters long.

I have way more CPU than is being used during the search, so I feel like splitting the search over 4 processes in batches of 500k would allow it to run the search concurrently and therefore complete faster.

I'd be interested to know if anyone has tried this or implemented their own equivalent programatically in SQL.

Best Answer

The default setting of "max_parallel_workers_per_gather" is 2, which won't spread work over all 4 CPUs for any one query. But that doesn't matter if you aren't getting parallel plans in the first place.

Parallel query is a relatively new feature to PostgreSQL, and is still being improved. You should use the newest version you can to give yourself the best chance of benefiting from it.

I believe the index consultation cannot be parallelized (in any version). The table consultation can be, but it often doesn't make sense to.

If the indexed part of the query is highly selective and returns few rows, then "parallel_setup_cost" will exceed the benefit of parallelizing the table access for just a few rows.

On the other hand, if you return a lot of rows, then "parallel_tuple_cost" (multiplied by rows returned) will exceed the benefit. If you access a lot of rows, but don't return them (like count(*) or some other aggregate, or a filter which the index is unable to address) that is the optimal case for parallelization to work well.

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

What I still don't understand, is why this is slower.

That sorting the rows will cost something is obvious. But why so much?
Without ORDER BY rank0... Postgres can just

pick the first 5 rows it finds and stop fetching rows as soon as it has 5.

Bitmap Heap Scan on entities ... rows=5 ...
then compute ts_rank() for just 5 rows.

In the second case, Postgres has to

fetch all (1495 according to your query plan) rows that qualify.

Bitmap Heap Scan on entities ... rows=1495 ...
compute ts_rank() for all of them.
sort all of them to find the first 5 according to the calculated value.

Try ORDER BY name just to see the cost of computing to_tsquery('english', 'hockey'::text)) for the superfluous rows and how much remains for fetching more rows and sorting.

Postgresql – Full Text Search With PostgreSQL

This is not really a use case for full text search because full text relies on stemming the text and parsing the chunks into tokens. As you can see from keywords, '580h' is parsed as its own word because there's no language in which '580' is a "stem" of '580h'. You'd probably be better off with regular expression matching.

Here's a query that I worked up for you:

SELECT id, title 
  FROM stickers WHERE
    (title ~* '580')
      AND
    (title ~* 'case')
ORDER BY id

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – Full Text Search With PostgreSQL

Related Question