PostgreSQL 10.4
I have a table that has already an index on date
column. Current query plan is doing a Bitmap Heap Scan using the existing date index. I would like to add a new index for this query, no parameters are injected to the query, I started with a partial index to the status
column, but I don't know how to handle group and sort by part.
select date, hour, sum(installs) as installs, sum(clicks) as clicks
from ho_aggregated_stats
where date > (current_date - interval '2 day')
and (status='approved' or status is null)
group by date, hour
order by date, hour;
explain https://explain.depesz.com/s/rnCW
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=992433.95..992442.13 rows=1488 width=24) (actual time=3903.296..3903.337 rows=43 loops=1)
Group Key: date, hour
Buffers: shared hit=85314 read=11496
I/O Timings: read=2896.216
-> Sort (cost=992433.95..992434.69 rows=1488 width=24) (actual time=3903.290..3903.298 rows=86 loops=1)
Sort Key: date, hour
Sort Method: quicksort Memory: 31kB
Buffers: shared hit=85314 read=11496
I/O Timings: read=2896.216
-> Gather (cost=992265.00..992418.27 rows=1488 width=24) (actual time=3903.167..3903.233 rows=86 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=85314 read=11496
I/O Timings: read=2896.216
-> Partial HashAggregate (cost=991265.00..991269.47 rows=1488 width=24) (actual time=3899.779..3899.808 rows=43 loops=2)
Group Key: date, hour
Buffers: shared hit=149987 read=16557
I/O Timings: read=4694.060
-> Parallel Bitmap Heap Scan on ho_aggregated_stats (cost=21995.80..990158.35 rows=553327 width=16) (actual time=1232.325..3623.710 rows=592709 loops=2)
Recheck Cond: (date > (CURRENT_DATE - '2 days'::interval))
Filter: (((status)::text = 'approved'::text) OR (status IS NULL))
Rows Removed by Filter: 3946
Heap Blocks: exact=91807
Buffers: shared hit=149987 read=16557
I/O Timings: read=4694.060
-> Bitmap Index Scan on index_ho_aggregated_stats_on_date (cost=0.00..21948.76 rows=1160433 width=0) (actual time=1194.685..1194.685 rows=1339010 loops=1)
Index Cond: (date > (CURRENT_DATE - '2 days'::interval))
Buffers: shared read=5003
I/O Timings: read=1082.452
Planning time: 0.611 ms
Execution time: 3948.178 ms
table schema
CREATE TABLE public.stats (
id bigserial NOT NULL,
"date" date NOT NULL,
"hour" int4 NOT NULL,
status varchar NULL,
installs int4 NULL DEFAULT 0,
clicks int4 NULL DEFAULT 0,
CONSTRAINT stats_pkey PRIMARY KEY (id)
)
CREATE INDEX index_stats_on_date ON public.stats USING btree (date);
Estimate row count: ~40M
Update:
I checked the distribution on the status column and 75% is null, 20% approved, 5% rejected, thinking the index on status is not necessary.
Best Answer
You provided crucial information in a later comment:
This allows for radically different, probably much faster queries:
And:
As opposed to your original, this will also detect hours where there are no rows at all, which should be the most worrying case to begin with (I guess).
About
generate_series()
:Aside: there is a corner-case bug lurking in your query: the current date depends on the time zone setting of your current session. So the query might give different (possibly misleading) results depending on where you run it. It's generally best to operate with
timestamptz
to avoid any such complications. You might replace(date, hour)
with a singletimestamptz
column. Same size. It's cheap to derive date / hour etc. from it.Index
This should be massively faster even with a plain index on
(date, hour)
, thus keeping the cost for the index itself low. The semi-anti-join resulting fromNOT EXISTS
can scan the index, and discard each hour as soon as the first matching row is found and look no further. No need to aggregate all qualifying rows like your original did.I suggest to replace the index you have:
index_ho_aggregated_stats_on_date
on just(date)
with one on(date, hour)
, or(date DESC, hour DESC)
, hardly matters for the case. It's exactly the same size as your old index, sincedate
+integer
occupy 8 bytes together. And it practically does everything your old index did, plus more. See:If and only if a majority of rows fails the additional condition
WHERE (status = 'approved' or status is null)
, then it might make sense to add a partial index with that condition. Else it's cheaper not to create another index and let Postgres add aFILTER
step to the index scan.If your table is huge (around 0.5M rows per day?) it might make sense to cut of the bulk of old data from the index. See:
Or consider a BRIN index if table rows are mostly sorted by
(date, hour)
, physically. Related:(Probably not worth it if you have an index on
(date, hour)
anyway.)Finally, since "Data is being written to the table every 5mins via multiple UPSERT queries", consider a manual
VACUUM ANALYZE
(or justVACUUM
) right after that to allow index-only scans in combination with a partial, multicolumn index covering all involved columns. You'll have to weigh cost and gain of this.