Postgresql – HLL aggregation query is very slow

performancepostgresqlpostgresql-11postgresql-performance

I have the following query

select 
    ts_date, 
    sum(bids_placed), 
    hll_cardinality(hll_union_agg(eligible_uu)) 
from 
    performancedata 
where 
    ts_date>='2019-08-01' 
    and ts_date<='2019-08-31' 
    and licensee_id = 219 
group by 
    ts_date 
order by 
    ts_date asc;

below is it's explain analyze output :

   Sort Key: performancedata20190801ln219.ts_date
   Sort Method: quicksort  Memory: 26kB
   ->  HashAggregate  (cost=14235.72..14236.19 rows=27 width=44) (actual time=1619.576..1621.749 rows=31 loops=1)
         Group Key: performancedata20190801ln219.ts_date
         ->  Append  (cost=0.00..13541.04 rows=92624 width=87) (actual time=0.039..50.379 rows=92624 loops=1)
               ->  Seq Scan on performancedata20190801ln219  (cost=0.00..473.79 rows=3302 width=100) (actual time=0.038..1.520 rows=3302 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190802ln219  (cost=0.00..493.99 rows=3371 width=94) (actual time=0.026..1.551 rows=3371 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190803ln219  (cost=0.00..489.24 rows=3328 width=90) (actual time=0.021..1.558 rows=3328 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190804ln219  (cost=0.00..492.66 rows=3295 width=94) (actual time=0.019..1.547 rows=3295 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190805ln219  (cost=0.00..425.41 rows=3109 width=80) (actual time=0.020..1.427 rows=3109 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190806ln219  (cost=0.00..430.70 rows=3126 width=90) (actual time=0.022..1.425 rows=3126 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190807ln219  (cost=0.00..422.36 rows=3049 width=84) (actual time=0.018..1.396 rows=3049 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190808ln219  (cost=0.00..408.64 rows=2951 width=93) (actual time=0.020..1.357 rows=2951 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190809ln219  (cost=0.00..397.68 rows=2953 width=78) (actual time=0.024..1.351 rows=2953 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190810ln219  (cost=0.00..404.25 rows=2986 width=84) (actual time=0.021..1.331 rows=2986 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190811ln219  (cost=0.00..388.61 rows=2949 width=75) (actual time=0.016..1.321 rows=2949 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190812ln219  (cost=0.00..390.20 rows=2983 width=76) (actual time=0.020..1.313 rows=2983 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190813ln219  (cost=0.00..411.71 rows=2955 width=89) (actual time=0.016..1.320 rows=2955 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190814ln219  (cost=0.00..411.31 rows=2989 width=83) (actual time=0.017..1.359 rows=2989 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190815ln219  (cost=0.00..406.91 rows=2966 width=78) (actual time=0.015..1.351 rows=2966 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190816ln219  (cost=0.00..392.84 rows=2848 width=86) (actual time=0.014..1.299 rows=2848 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190817ln219  (cost=0.00..399.24 rows=2928 width=82) (actual time=0.018..1.311 rows=2928 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190818ln219  (cost=0.00..416.17 rows=2924 width=82) (actual time=0.015..1.328 rows=2924 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190819ln219  (cost=0.00..381.37 rows=2821 width=76) (actual time=0.016..1.443 rows=2821 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190820ln219  (cost=0.00..408.20 rows=2926 width=93) (actual time=0.021..1.351 rows=2926 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190821ln219  (cost=0.00..389.72 rows=2841 width=85) (actual time=0.016..1.302 rows=2841 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190822ln219  (cost=0.00..406.66 rows=2895 width=95) (actual time=0.013..1.330 rows=2895 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190823ln219  (cost=0.00..388.72 rows=2898 width=81) (actual time=0.016..1.330 rows=2898 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190824ln219  (cost=0.00..411.61 rows=2949 width=84) (actual time=0.014..1.350 rows=2949 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190825ln219  (cost=0.00..382.42 rows=2938 width=72) (actual time=0.022..1.332 rows=2938 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190826ln219  (cost=0.00..392.82 rows=2904 width=89) (actual time=0.023..1.326 rows=2904 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190827ln219  (cost=0.00..394.77 rows=2901 width=107) (actual time=0.024..1.355 rows=2901 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190828ln219  (cost=0.00..534.78 rows=2902 width=102) (actual time=0.020..1.444 rows=2902 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190829ln219  (cost=0.00..520.87 rows=2964 width=97) (actual time=0.022..1.539 rows=2964 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190830ln219  (cost=0.00..445.59 rows=2891 width=85) (actual time=0.020..1.416 rows=2891 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
               ->  Seq Scan on performancedata20190831ln219  (cost=0.00..364.69 rows=2782 width=96) (actual time=0.016..1.295 rows=2782 loops=1)
                     Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
 Planning Time: 133.162 ms
 Execution Time: 1622.475 ms
(70 rows)

This query takes 1.6 secs, while if I remove the hll aggregations the query executes in 60 ms, below is the query without hll and it's explain analyze output.

select 
    ts_date, 
    sum(bids_placed) 
from 
    performancedata 
where 
    ts_date>='2019-08-01' 
    and ts_date<='2019-08-31' 
    and licensee_id = 219 
group by 
    ts_date 
order by 
    ts_date asc;
   Group Key: performancedata20190828ln219.ts_date
   ->  Gather Merge  (cost=13797.43..13803.73 rows=54 width=36) (actual time=57.583..59.935 rows=34 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=12797.40..12797.47 rows=27 width=36) (actual time=24.012..24.013 rows=11 loops=3)
               Sort Key: performancedata20190828ln219.ts_date
               Sort Method: quicksort  Memory: 26kB
               Worker 0:  Sort Method: quicksort  Memory: 26kB
               Worker 1:  Sort Method: quicksort  Memory: 25kB
               ->  Partial HashAggregate  (cost=12796.42..12796.76 rows=27 width=36) (actual time=23.986..23.990 rows=11 loops=3)
                     Group Key: performancedata20190828ln219.ts_date
                     ->  Parallel Append  (cost=0.00..12603.45 rows=38594 width=12) (actual time=0.017..18.346 rows=30875 loops=3)
                           ->  Parallel Seq Scan on performancedata20190828ln219  (cost=0.00..513.87 rows=1707 width=12) (actual time=0.007..1.313 row
s=2902 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190829ln219  (cost=0.00..499.51 rows=1744 width=12) (actual time=0.006..1.357 row
s=2964 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190802ln219  (cost=0.00..469.70 rows=1983 width=12) (actual time=0.007..1.318 row
s=3371 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190804ln219  (cost=0.00..468.92 rows=1938 width=12) (actual time=0.003..1.306 row
s=3295 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190803ln219  (cost=0.00..465.26 rows=1958 width=12) (actual time=0.004..1.271 row
s=3328 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190801ln219  (cost=0.00..449.99 rows=1942 width=12) (actual time=0.034..1.902 row
s=3302 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190830ln219  (cost=0.00..424.76 rows=1701 width=12) (actual time=0.003..1.199 row
s=2891 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190806ln219  (cost=0.00..408.18 rows=1839 width=12) (actual time=0.017..1.153 row
s=3126 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190805ln219  (cost=0.00..403.00 rows=1829 width=12) (actual time=0.007..1.761 row
s=3109 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190807ln219  (cost=0.00..400.39 rows=1794 width=12) (actual time=0.003..1.157 row
s=3049 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190818ln219  (cost=0.00..395.10 rows=1720 width=12) (actual time=0.007..1.127 row
s=2924 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190813ln219  (cost=0.00..390.42 rows=1738 width=12) (actual time=0.007..1.692 row
s=2955 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190824ln219  (cost=0.00..390.36 rows=1735 width=12) (actual time=0.006..1.174 row
s=2949 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190814ln219  (cost=0.00..389.77 rows=1758 width=12) (actual time=0.007..0.563 row
s=996 loops=3)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190808ln219  (cost=0.00..387.38 rows=1736 width=12) (actual time=0.003..1.091 row
s=2951 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190820ln219  (cost=0.00..387.12 rows=1721 width=12) (actual time=0.007..0.790 row
s=1463 loops=2)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190822ln219  (cost=0.00..385.80 rows=1703 width=12) (actual time=0.005..1.626 row
s=2895 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190815ln219  (cost=0.00..385.53 rows=1745 width=12) (actual time=0.007..1.615 row
s=2966 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190810ln219  (cost=0.00..382.74 rows=1756 width=12) (actual time=0.008..1.638 row
s=2986 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190817ln219  (cost=0.00..378.14 rows=1722 width=12) (actual time=0.009..1.649 row
s=2928 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190809ln219  (cost=0.00..376.40 rows=1737 width=12) (actual time=0.007..1.672 row
s=2953 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190827ln219  (cost=0.00..373.86 rows=1706 width=12) (actual time=0.006..1.719 row
s=2901 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190816ln219  (cost=0.00..372.32 rows=1675 width=12) (actual time=0.006..1.642 row
s=2848 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190826ln219  (cost=0.00..371.89 rows=1708 width=12) (actual time=0.008..1.703 row
s=2904 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190821ln219  (cost=0.00..369.25 rows=1671 width=12) (actual time=0.006..1.668 row
s=2841 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190812ln219  (cost=0.00..368.71 rows=1755 width=12) (actual time=0.010..1.710 row
s=2983 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190823ln219  (cost=0.00..367.83 rows=1705 width=12) (actual time=0.007..1.701 row
s=2898 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190811ln219  (cost=0.00..367.36 rows=1735 width=12) (actual time=0.006..1.725 row
s=2949 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190825ln219  (cost=0.00..361.24 rows=1728 width=12) (actual time=0.006..1.722 row
s=2938 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190819ln219  (cost=0.00..361.04 rows=1659 width=12) (actual time=0.007..1.667 row
s=2821 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
                           ->  Parallel Seq Scan on performancedata20190831ln219  (cost=0.00..344.64 rows=1636 width=12) (actual time=0.009..1.399 row
s=2782 loops=1)
                                 Filter: ((ts_date >= '2019-08-01'::date) AND (ts_date <= '2019-08-31'::date) AND (licensee_id = 219))
 Planning Time: 135.839 ms
 Execution Time: 60.662 ms
(77 rows)

If we do not have hll agggregation then it is executing query parallely, while if hll aggregation is present then it does sequential scan and does HashAggregate which takes the most of the time. Is there any way to improve this query performance?

Below are my postgresql server configurations :

PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit

shared_buffers : 3936960kB
work_mem : 4 MB
effective_cache_size : 12GB

Postgresql server is having 4 cores, 16 GB RAM & 100 GB SSD.

Any suggestions are welcome. I have done VACUUM & ANALYZE operations, but still performance did not improve, have also played with work_mem by setting till 32 MB and it only improved things by 6-7 ms after that the performance degrades.

Will keeping the table in memory help and if so how should I go about it. Please let me know if you need further details.

Best Answer

That would mean that at least one of hll_cardinality and hll_union_agg must not be parallel safe.

Try

\df+ hll_cardinality
\df+ hll_union_agg

in psql and see what it says under Parallel.

If the functions are parallel safe, you can use ALTER TABLE to mark them as such. If in doubt, ask the author of the functions.

In addition, the aggregate function must have COMBINEFUNC defined. Check with

SELECT f.proparallel,
       a.aggcombinefn
FROM pg_aggregate AS a
   JOIN pg_proc AS f ON f.oid = a.aggfnoid
WHERE f.proname = 'hll_union_agg';