PostgreSQL Performance – How to Get Rid of Bitmap Heap Scan with Proper Indices

indexperformancepostgresqlquery-performance

I am running PostgreSQL 9.6. These are the relevant definitions:

CREATE TABLE IF NOT EXISTS instagram.profiles_1000 (
    id                          SERIAL PRIMARY KEY,
    username                    VARCHAR(255) NOT NULL UNIQUE,
    followers                   BIGINT,
    tsv                         TSVECTOR
);

CREATE UNIQUE INDEX IF NOT EXISTS instagram_username_index
    ON instagram.profiles_1000(username);
CREATE INDEX IF NOT EXISTS instagram_followers_index
    ON instagram.profiles_1000(followers);
CREATE INDEX IF NOT EXISTS instagram_textsearch_index
    ON instagram.profiles_1000 USING GIN(tsv);

And the text vector is updated by a trigger:

CREATE FUNCTION instagram_documents_search_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
        setweight(to_tsvector(COALESCE(new.username, '')), 'D') || ' ' ||
        setweight(to_tsvector(COALESCE(new.full_name, '')), 'C') || ' ' ||
        setweight(to_tsvector(COALESCE(new.location_country, '')), 'B') || ' ' ||
        setweight(to_tsvector(COALESCE(new.location_region, '')), 'B') || ' ' ||
        setweight(to_tsvector(COALESCE(new.biography, '')), 'A') || ' ' ||
        setweight(to_tsvector(COALESCE(new.location_city, '')), 'A');
  return new;
end
$$ LANGUAGE plpgsql;


CREATE TRIGGER instagram_tsvectorupdate BEFORE INSERT OR UPDATE
    ON instagram.profiles_1000 FOR EACH ROW
    EXECUTE PROCEDURE instagram_documents_search_trigger();

This is the query:

select instagram.profiles_1000.*, categories, followers as rank                                                                                            
from instagram.profiles_1000
join plainto_tsquery('arts') as q on q @@ tsv
left outer join instagram.profile_categories_agg on instagram.profiles_1000.username = instagram.profile_categories_agg.username
where followers is not null and followers > 0
order by (followers, -id) desc
limit 50;

This is the output of EXPLAIN (ANALYZE, BUFFERS):

https://explain.depesz.com/s/ceCd

The culprit is the Bitmap Heap Scan which makes up the bulk of the total execution time. Frankly I don't understand why it's needed, especially since the Bitmap Index Scan on instagram_textsearch_index already filters the rows according to the search term.

Can someone shed some light?

EDIT It was pointed out that I misread the explain output. Indeed, the left outer join was taking a lot of time. I tried to remove it as follows:

select instagram.profiles_1000.*, followers as rank
from instagram.profiles_1000
join plainto_tsquery('arts') as q on q @@ tsv                                              
where followers is not null and followers > 0
order by (followers, -id) desc
limit 50;

But the query still takes 13 seconds! This is the EXPLAIN (ANALYZE, BUFFERS) output:

https://explain.depesz.com/s/awfH

Now the bottleneck seems to be the full-text search. Is it really that slow? The table has just 5 million rows, and the tsv (which has type TSVECTOR) is indexed by the following index:

CREATE INDEX IF NOT EXISTS instagram_textsearch_index_1000
    ON instagram.profiles_1000 USING GIN(tsv);

EDIT 2 I realized that I can write a leaner query if I only process the profiles that match the search (which are always 50 at most). Using this query:

select p.*, categories
from
    (select id
    from instagram.profiles_1000, plainto_tsquery('arts') as q
    where q @@ tsv and followers is not null and followers > 0
    order by (followers, -id) desc
    limit 50) as ids
inner join instagram.profiles_1000 as p on
    p.id = ids.id
left outer join instagram.profile_categories_agg as c on
    c.username = p.username;

I am able to obtain this result:
https://explain.depesz.com/s/OvG

Which puts the search at ~3 seconds. It would be nicer to reach 1 second at least though.

Best Answer

If you want improve further on the timing, your best bet might be to abandon use of the FTS index, at least for cases where the @@ match criteria returns a lot of results.

First you would have to change your ORDER BY from order by (followers, -id) desc to order by followers desc, id. This version is semantically equivalent (except perhaps in how it handles NULL values) but it does not go through the step of having to package up the two columns into a pseudo-row and then sorting those row values. It sorts on the column values directly. This direct sorting is much faster, but more importantly it opens up the possibility to use an index, rather than a sort, to fulfill the ORDER BY.

Then if you create an index on (followers desc, id), your query can step through that index looking for rows that satisfy the @@ condition, stopping once it finds 50 of them. Doing it this way could be much faster than pulling out over 100,000 rows that are @@ matches and sorting them to pull out the top 50.