Postgresql – Why full text search on table with GIN index is still very slow

full-text-searchindexpostgresql

With what I've been able to gather thus far, if you need to run a full text search on a table in a PostgreSQL (psql 9.6.2, server 9.6.5) database with lots of entries (say on the order 1.2M+), the recommended approach is to create an index for that table (in this case we've created a GIN index) which should allow you to run a query like this:

SELECT * FROM speech WHERE speech_tsv @@ plainto_tsquery('a text string')

In addition to the results of this query sometimes not containing anything related search string, it often will take anywhere from 8-10 seconds.

The db is deployed on a fairly sizable, multicore EC2 instance, so I'm thinking, is it possible there's something else we could do to the db to help these queries run faster?

Or this the query execution time about reasonable given the huge amount of files and text we're asking it to search through (even via indexing)?

Here's what the table looks like:

                                         Table "public.speech"
        Column     |            Type             |                      Modifiers                      
    ---------------+-----------------------------+-----------------------------------------------------
     speech_id     | integer                     | not null default nextval('speech_id_seq'::regclass)
     speechtype_id | smallint                    | not null
     title         | character varying           | not null default ''::character varying
     speechdate    | date                        | default now()
     location      | character varying           | not null default ''::character varying
     source        | character varying           | not null default ''::character varying
     speechtext    | text                        | not null
     url           | character varying           | not null default ''::character varying
     release_id    | smallint                    | 
     created       | timestamp without time zone | 
     modified      | timestamp without time zone | 
     speech_tsv    | tsvector                    | 
     key           | boolean                     | 
     summary       | text                        | 
     quote         | text                        | 
    Indexes:
        "speech_pk" PRIMARY KEY, btree (speech_id)
        "speech__release_id" btree (release_id)
        "speech__speech_tsv" gin (speech_tsv)
        "speech__speechdate" btree (speechdate)
        "speech__speechtype_id" btree (speechtype_id)

Foreign-key constraints:
    "speech__release_id_fk" FOREIGN KEY (release_id) REFERENCES release(release_id) MATCH FULL ON DELETE RESTRICT DEFERRABLE INITIALLY DEFERRED
    "speech__speechtype_id_fk" FOREIGN KEY (speechtype_id) REFERENCES speechtype(speechtype_id) MATCH FULL DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "factcheck_speech" CONSTRAINT "factcheck_speech_speech_id_fkey" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_candidate" CONSTRAINT "speech_candidate__speech_id_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_category" CONSTRAINT "speech_category__speech_id_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_tag" CONSTRAINT "speech_tag__speech_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speechlocking" CONSTRAINT "speechlocking__fkey" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
Triggers:
    speech_updated BEFORE INSERT OR UPDATE ON speech FOR EACH ROW EXECUTE PROCEDURE pvs_speech_updated()
    update_speech_created BEFORE INSERT ON speech FOR EACH ROW EXECUTE PROCEDURE update_created_column()
    update_speech_modified BEFORE UPDATE ON speech FOR EACH ROW EXECUTE PROCEDURE update_modified_column()

(speechtext would be the column holding all the text to be searched, obviously)

Here is an example EXPLAIN (ANALYZE,BUFFERS) query executed directly on the server (although these queries are actually executed within a Python application, so it's running a bit quicker here with no network latency, etc):

                          QUERY PLAN                                                               
-------------------------------------------------------------------------
 Bitmap Heap Scan on speech  (cost=294.85..7931.12 rows=6142 width=1058) (actual time=400.623..67768.222 rows=27267 loops=1)
   Recheck Cond: (speech_tsv @@ plainto_tsquery('gun'::text))
   Heap Blocks: exact=23582
   Buffers: shared hit=2413 read=21424
   ->  Bitmap Index Scan on speech__speech_tsv  (cost=0.00..293.31 rows=6142 width=0) (actual time=279.709..279.709 rows=30535 loops=1)
         Index Cond: (speech_tsv @@ plainto_tsquery('gun'::text))
         Buffers: shared hit=241 read=14
 Planning time: 0.187 ms
 Execution time: 67778.684 ms
(9 rows)

Best Answer

If you look at the explain output, the actual index scan is not terribly slow at ~280ms. The slow part is fetching all the data you asked for in the second step.

You're doing a SELECT * here, you're asking for all the columns in that table. And from the explain output it looks like this is a pretty wide table with many or large columns. And your query is fetching ~27000 of those large rows.

The "read" and "hit" part of the Buffers line tells you that 21424 blocks had to be read from your hard drive or SSD, they were not cached in RAM. When you're reading a lot of data from disk, it will take some time.

Another factor is that you're transferring all that data you asked for to the client, which also takes time.

You're asking the database for a lot of data, and I suspect that you don't need all that data. So you should be more specific in your query, and only query the columns you actually need, and add a LIMIT clause unless you actually want to fetch all 27267 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Optimizing ORDER BY in a full text search query

What I still don't understand, is why this is slower.

That sorting the rows will cost something is obvious. But why so much?
Without ORDER BY rank0... Postgres can just

pick the first 5 rows it finds and stop fetching rows as soon as it has 5.

Bitmap Heap Scan on entities ... rows=5 ...
then compute ts_rank() for just 5 rows.

In the second case, Postgres has to

fetch all (1495 according to your query plan) rows that qualify.

Bitmap Heap Scan on entities ... rows=1495 ...
compute ts_rank() for all of them.
sort all of them to find the first 5 according to the calculated value.

Try ORDER BY name just to see the cost of computing to_tsquery('english', 'hockey'::text)) for the superfluous rows and how much remains for fetching more rows and sorting.

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Optimizing ORDER BY in a full text search query

Related Question