Postgresql – Need help in optimizing a slow Postgres Query

optimizationpostgresqlquery-performance

So, I need to calculate a rolling Monthly active users on everyday for last 100 days. In output, one column should represent the dates of last 100 days from today and against every single day, number of distinct values of users who occur in the activity table in past 30 days from that day should be calculated.

The query that i have written is returning the desired output. But is too slow. Infact doesn't work for 100 days.

Need Help with optimization !!

Actual Postgresql Code #

Select A.Daily,count(distinct da.user) as Users
from 
    (
        select 
            date(d) as Daily
        from generate_series(current_date - Interval '10' DAY, current_date - Interval '1' DAY, '1 day') d 
        group by 1
        order by 1
    ) as A 
join public.downloads_activity da 
on date_trunc('day',da.created_at) between A.Daily - Interval '30' DAY and A.Daily - Interval '1' DAY
group by 1
order by 1

Expected and Returned Result

      Daily                  Users
January 12, 2020, 12:00 AM    35,654
January 13, 2020, 12:00 AM    35,465
January 14, 2020, 12:00 AM    35,777
January 15, 2020, 12:00 AM    35,980
January 16, 2020, 12:00 AM    35,953
January 17, 2020, 12:00 AM    35,899
January 18, 2020, 12:00 AM    35,829
January 19, 2020, 12:00 AM    35,627
January 20, 2020, 12:00 AM    35,348
January 21, 2020, 12:00 AM    35,564

Query Plan

GroupAggregate (cost=25.16..70992118.48 rows=200 width=26)
  Group Key: (date(d.d))
  -> Nested Loop (cost=25.16..69869352.15 rows=224552867 width=26)
       Join Filter: ((date_trunc('day'::text, ast_d.created_at) >= ((date(d.d)) - '30 days'::interval day)) AND (date_trunc('day'::text, ast_d.created_at) <= ((date(d.d)) - '1 day'::interval day)))
       -> Sort (cost=25.16..25.66 rows=200 width=8)
            Sort Key: (date(d.d))
            -> HashAggregate (cost=15.02..17.52 rows=200 width=8)
                 Group Key: date(d.d)
                 -> Function Scan on generate_series d (cost=0.02..12.52 rows=1000 width=8)
       -> Materialize (cost=0.00..571429.19 rows=10104879 width=30)
            -> Seq Scan on asset_downloads ast_d (cost=0.00..451827.79 rows=10104879 width=30)

Query Plan With Analyze Buffers

GroupAggregate (cost=25.16..71011891.94 rows=200 width=26) (actual time=24608.996..182045.961 rows=10 loops=1)
  Group Key: (date(d.d))
  Buffers: shared hit=325977 read=24911 dirtied=28, temp read=552011 written=118170
  -> Nested Loop (cost=25.16..69888811.94 rows=224615600 width=26) (actual time=2213.459..93364.594 rows=16514694 loops=1)
       Join Filter: ((date_trunc('day'::text, ast_d.created_at) >= ((date(d.d)) - '30 days'::interval day)) AND (date_trunc('day'::text, ast_d.created_at) <= ((date(d.d)) - '1 day'::interval day)))
       Rows Removed by Join Filter: 84818776
       Buffers: shared hit=325969 read=24911 dirtied=28, temp read=488070 written=54229
       -> Sort (cost=25.16..25.66 rows=200 width=8) (actual time=0.081..0.093 rows=10 loops=1)
            Sort Key: (date(d.d))
            Sort Method: quicksort Memory: 25kB
            Buffers: shared hit=3
              -> HashAggregate (cost=15.02..17.52 rows=200 width=8) (actual time=0.044..0.052 rows=10 loops=1)
                 Group Key: date(d.d)
                 -> Function Scan on generate_series d (cost=0.02..12.52 rows=1000 width=8) (actual time=0.028..0.033 rows=10 loops=1)
       -> Materialize (cost=0.00..571588.53 rows=10107702 width=30) (actual time=0.006..3572.397 rows=10133347 loops=10)
            Buffers: shared hit=325966 read=24911 dirtied=28, temp read=488070 written=54229
            -> Seq Scan on asset_downloads ast_d (cost=0.00..451954.02 rows=10107702 width=30) (actual time=0.016..3797.284 rows=10133347 loops=1)
                 Buffers: shared hit=325966 read=24911 dirtied=28
Planning time: 8.678 ms
Execution time: 182102.426 ms

Best Answer

Select A.Daily,
       count(distinct da.user) as Users
from 
    (
        -- calculate all constants needed once
        select date(d) as Daily,
               date(d) - Interval '30' DAY as DailyStart,
               date(d) - Interval '1' DAY as DailyEnd
        from generate_series(current_date - Interval '10' DAY, current_date - Interval '1' DAY, '1 day') d 
    -- GROUP BY by unique value makes no sense
    -- ORDER BY in subquery is ignored
    ) as A 
                                 -- avoid fn(table.field)
join public.downloads_activity da on da.created_at >= A.DailyStart 
                                 and da.created_at < A.DailyEnd
group by 1
order by 1