Postgresql – Why full text search on table with GIN index is still very slow

full-text-searchindexpostgresql

With what I've been able to gather thus far, if you need to run a full text search on a table in a PostgreSQL (psql 9.6.2, server 9.6.5) database with lots of entries (say on the order 1.2M+), the recommended approach is to create an index for that table (in this case we've created a GIN index) which should allow you to run a query like this:

SELECT * FROM speech WHERE speech_tsv @@ plainto_tsquery('a text string')

In addition to the results of this query sometimes not containing anything related search string, it often will take anywhere from 8-10 seconds.

The db is deployed on a fairly sizable, multicore EC2 instance, so I'm thinking, is it possible there's something else we could do to the db to help these queries run faster?

Or this the query execution time about reasonable given the huge amount of files and text we're asking it to search through (even via indexing)?

Here's what the table looks like:

                                         Table "public.speech"
        Column     |            Type             |                      Modifiers                      
    ---------------+-----------------------------+-----------------------------------------------------
     speech_id     | integer                     | not null default nextval('speech_id_seq'::regclass)
     speechtype_id | smallint                    | not null
     title         | character varying           | not null default ''::character varying
     speechdate    | date                        | default now()
     location      | character varying           | not null default ''::character varying
     source        | character varying           | not null default ''::character varying
     speechtext    | text                        | not null
     url           | character varying           | not null default ''::character varying
     release_id    | smallint                    | 
     created       | timestamp without time zone | 
     modified      | timestamp without time zone | 
     speech_tsv    | tsvector                    | 
     key           | boolean                     | 
     summary       | text                        | 
     quote         | text                        | 
    Indexes:
        "speech_pk" PRIMARY KEY, btree (speech_id)
        "speech__release_id" btree (release_id)
        "speech__speech_tsv" gin (speech_tsv)
        "speech__speechdate" btree (speechdate)
        "speech__speechtype_id" btree (speechtype_id)

Foreign-key constraints:
    "speech__release_id_fk" FOREIGN KEY (release_id) REFERENCES release(release_id) MATCH FULL ON DELETE RESTRICT DEFERRABLE INITIALLY DEFERRED
    "speech__speechtype_id_fk" FOREIGN KEY (speechtype_id) REFERENCES speechtype(speechtype_id) MATCH FULL DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "factcheck_speech" CONSTRAINT "factcheck_speech_speech_id_fkey" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_candidate" CONSTRAINT "speech_candidate__speech_id_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_category" CONSTRAINT "speech_category__speech_id_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_tag" CONSTRAINT "speech_tag__speech_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speechlocking" CONSTRAINT "speechlocking__fkey" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
Triggers:
    speech_updated BEFORE INSERT OR UPDATE ON speech FOR EACH ROW EXECUTE PROCEDURE pvs_speech_updated()
    update_speech_created BEFORE INSERT ON speech FOR EACH ROW EXECUTE PROCEDURE update_created_column()
    update_speech_modified BEFORE UPDATE ON speech FOR EACH ROW EXECUTE PROCEDURE update_modified_column()

(speechtext would be the column holding all the text to be searched, obviously)

Here is an example EXPLAIN (ANALYZE,BUFFERS) query executed directly on the server (although these queries are actually executed within a Python application, so it's running a bit quicker here with no network latency, etc):

                          QUERY PLAN                                                               
-------------------------------------------------------------------------
 Bitmap Heap Scan on speech  (cost=294.85..7931.12 rows=6142 width=1058) (actual time=400.623..67768.222 rows=27267 loops=1)
   Recheck Cond: (speech_tsv @@ plainto_tsquery('gun'::text))
   Heap Blocks: exact=23582
   Buffers: shared hit=2413 read=21424
   ->  Bitmap Index Scan on speech__speech_tsv  (cost=0.00..293.31 rows=6142 width=0) (actual time=279.709..279.709 rows=30535 loops=1)
         Index Cond: (speech_tsv @@ plainto_tsquery('gun'::text))
         Buffers: shared hit=241 read=14
 Planning time: 0.187 ms
 Execution time: 67778.684 ms
(9 rows)

Best Answer

If you look at the explain output, the actual index scan is not terribly slow at ~280ms. The slow part is fetching all the data you asked for in the second step.

You're doing a SELECT * here, you're asking for all the columns in that table. And from the explain output it looks like this is a pretty wide table with many or large columns. And your query is fetching ~27000 of those large rows.

The "read" and "hit" part of the Buffers line tells you that 21424 blocks had to be read from your hard drive or SSD, they were not cached in RAM. When you're reading a lot of data from disk, it will take some time.

Another factor is that you're transferring all that data you asked for to the client, which also takes time.

You're asking the database for a lot of data, and I suspect that you don't need all that data. So you should be more specific in your query, and only query the columns you actually need, and add a LIMIT clause unless you actually want to fetch all 27267 rows.