PostgreSQL Performance – Simple Select Query Taking Too Long

execution-planpostgresqlselect

I have been having a hard time with this query, I have a webpage which is supposed to run this query a dozen of times, and with an execution time up to 40 – 50 seconds per query, you can see how this is becoming quite problematic as the database keeps growing.

Query:

select distinct(scanid) 
from misspelled 
where word = ?

knowing that the table 'misspelled' can grow up to a dozen million rows, is there any way to make this work ?

Thanks, here are more details:

Query:

select distinct(scanid) 
from misspelled 
where word = ?

Explain Analyze:

HashAggregate  (cost=76018.68..76018.79 rows=11 width=8) (actual time=38359.639..38359.823 rows=1071 loops=1)
  Group Key: scanid
  ->  Bitmap Heap Scan on misspelled  (cost=1014.12..75918.78 rows=39960 width=8) (actual time=49.946..38319.196 rows=37993 loops=1)
        Recheck Cond: (word = 'Facebook*'::text)"
        Heap Blocks: exact=28920
        ->  Bitmap Index Scan on test_scanid  (cost=0.00..1004.13 rows=39960 width=0) (actual time=40.541..40.541 rows=37993 loops=1)
              Index Cond: (word = 'Facebook*'::text)
Planning time: 197.561 ms
Execution time: 38360.440 ms

Also available at: http://explain.depesz.com/s/7ttn

Table definition:

CREATE TABLE misspelled
(
  word text,
  url text,
  occnum BIGINT,
  proper BOOLEAN,
  scanid BIGINT,
  CONSTRAINT misspelled_scanid_fkey FOREIGN KEY (scanid)
      REFERENCES scans (scanid) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT misspelled_word_url_scanid_key UNIQUE (word, url, scanid)
);

CREATE INDEX test_scanid
  ON misspelled
  USING btree
  (word COLLATE pg_catalog."default", scanid);

http://pastebin.com/D6jrT2NC

Version 9.4.0

If you need anything else please let me know.

Thanks

Best Answer

You have almost 40000 rows where word='Facebook*'. I suppose that these rows are somewhat randomly scattered on the whole table. This means that Postgres has to read tens of thousands of random blocks from a disk, which might be slow.

You can try to cluster rows that have the same word together. You can use the command cluster misspelled using test_scanid;. This would take some time and block the table until done, so plan accordingly.