PostgreSQL Count – Count Approximation with Where Clause

countindexperformancepostgresqlpostgresql-9.4

I have the following problem, we have a table of facilities for apartments that looks something like this:

enter image description here

I would like to perform the following query

SELECT crawled_name, count(f.id) AS count FROM facility f
WHERE (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
GROUP BY f.crawled_name, apartment_id
  having count(apartment_id) > 10000
LIMIT 10;

After the table reached 100 million entries this query has become slow. I already tried creating an index on facility_characteristic but since the where clause matches more than 5-10% of the entries postgres performs a sequential scan.

I found this answer about using approximations for speeding up counts on Postgresql
Speeding up a GROUP BY, HAVING COUNT query
I tried rewriting my query to look like this

select DISTINCT(crawled_name), a.estimate_ct from facility f
INNER JOIN (SELECT v."crawled_name" as name, (c.reltuples * freq)::int AS estimate_ct
           FROM   pg_stats s
             CROSS  JOIN LATERAL
                           unnest(s.most_common_vals::text::text[]  -- use your actual data type
                           , s.most_common_freqs) WITH ORDINALITY v ("crawled_name", freq, ord)
             CROSS  JOIN (
                           SELECT reltuples FROM pg_class
                           WHERE oid = regclass 'facility'
                         ) c
           WHERE  schemaname = 'public'
                  AND    tablename  = 'facility'
                  AND    attname    = 'crawled_name'  -- case sensitive
           ORDER  BY v.ord
           LIMIT  100) as a on f.crawled_name = a.name
where
  (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
order by a.estimate_ct desc;

this query is faster but not fast enough. Can someone help me with pointers to make this faster.

Results of

EXPLAIN (ANALYZE, BUFFERS) SELECT crawled_name, count(f.id) AS count FROM facility f
WHERE (f.facility_characteristic IS NULL OR f.facility_characteristic = '')
GROUP BY f.crawled_name, apartment_id
  having count(apartment_id) > 10000
LIMIT 10;

are

Limit  (cost=11632509.62..11632511.00 rows=10 width=28) (actual time=1648959.720..1648959.720 rows=0 loops=1)
  Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
  ->  GroupAggregate  (cost=11632509.62..12224422.47 rows=4304821 width=28) (actual time=1648959.718..1648959.718 rows=0 loops=1)
        Group Key: crawled_name, apartment_id
        Filter: (count(apartment_id) > 10000)
        Rows Removed by Filter: 27660633
        Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
        ->  Sort  (cost=11632509.62..11740130.14 rows=43048207 width=28) (actual time=1341268.470..1633080.886 rows=39997609 loops=1)
              Sort Key: crawled_name, apartment_id
              Sort Method: external merge  Disk: 1679168kB
              Buffers: shared hit=82720 read=1625330 dirtied=191588 written=64279, temp read=750859 written=750859
              ->  Seq Scan on facility f  (cost=0.00..3084227.90 rows=43048207 width=28) (actual time=0.026..106099.542 rows=39997609 loops=1)
                    Filter: ((facility_characteristic IS NULL) OR (facility_characteristic = ''::text))
                    Rows Removed by Filter: 63180551
                    Buffers: shared hit=82712 read=1625330 dirtied=191588 written=64279
Planning time: 0.787 ms
Execution time: 1649266.193 ms

explain analyze output after answer from ziggy:

    Limit  (cost=886570.55..886570.57 rows=9 width=12) (actual time=36064.144..36064.183 rows=72 loops=1)
  ->  Sort  (cost=886570.55..886570.57 rows=9 width=12) (actual time=36064.142..36064.154 rows=72 loops=1)
        Sort Key: (count(*))
        Sort Method: quicksort  Memory: 30kB
        ->  HashAggregate  (cost=886570.29..886570.41 rows=9 width=12) (actual time=36053.920..36064.093 rows=72 loops=1)
              Group Key: crawled_name
              Filter: (count(*) > 100000)
              Rows Removed by Filter: 57147
              ->  Bitmap Heap Scan on facility  (cost=12260.48..882687.80 rows=517666 width=12) (actual time=2209.363..22588.928 rows=40050167 loops=1)
                    Recheck Cond: ((crawled_name IS NOT NULL) AND (NULLIF(facility_characteristic, ''::text) IS NULL))
                    Rows Removed by Index Recheck: 57509124
                    Heap Blocks: exact=33902 lossy=950736
                    ->  Bitmap Index Scan on facility_crawled_name_idx  (cost=0.00..12131.06 rows=517666 width=0) (actual time=2201.781..2201.781 rows=40193222 loops=1)
                          Index Cond: (crawled_name IS NOT NULL)
Planning time: 0.103 ms
Execution time: 36065.583 ms

Best Answer

How many unique values do you have for crawled_name? How feasible it is that you'll hit an apartment_id value greater than 2 billion (in the next 5-10 years)?

Note that COUNT(id) is the same as COUNT(*) (because id is not nullable), and that COUNT(apartment_id) > 10000 is effectively COUNT(*) > 10000 along with WHERE apartment_id IS NOT NULL. What you're doing not only complicates understanding the query, but may also cause PostgreSQL to select a sub-optimal strategy. By using COUNT(*) instead of COUNT(id), PostgreSQL can compute the whole thing on an index that does not include id, and without the need for reading the table at all, so it should process much faster.

Let's assume you currently have apartment_id values in the low millions, and it is unlikely you'll hit the value 2 billion in the near future. You could speed it up by creating an index and adjusting your query as follows:

CREATE INDEX ON facility (crawled_name, (apartment_id)::int)
  WHERE NULLIF(facility_characteristic, '') IS NULL
        AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL;

SELECT crawled_name, COUNT(*)
FROM facility
WHERE NULLIF(facility_characteristic, '') IS NULL
      AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL
GROUP BY crawled_name, apartment_id::int
HAVING COUNT(*) > 10000
LIMIT 10;

You could significantly speed it up further if you have a fixed list of crawled_name values that range in the dozens (or low hundreds), by turning the type of the column into an enum. You'll have to drop indexes that use this column prior:

DO $do$ DECLARE _l text[]; BEGIN
  _l := ARRAY(SELECT DISTINCT crawled_name FROM facility WHERE crawled_name > '' ORDER BY 1);
  EXECUTE $$CREATE TYPE crawled_name_enum AS enum ('$$ ||
            array_to_string(_l, $$','$$) || $$');$$;

  ALTER TABLE facility ALTER COLUMN crawled_name
    TYPE crawled_name_enum USING (crawled_name::crawled_name_enum);
END; $do$;