Postgresql – Most efficient way to query a date range in Postgresql

dateindexpostgresql

I have a table with a timestamp with tz column. I want to perform a count / group by query for rows within a certain date range:

select count(1), url 
from app_pageview 
where viewed_at between '2019-01-01' and '2020-01-01'
group by 2
order by 1 desc 
limit 10;

viewed_at has a btree index applied, but when I view explain... it doesn't appear to be using the index:

postgres=# explain select count(1), url 
from app_pageview 
where viewed_at between '2019-01-01' and '2019-01-31'
group by 2
order by 1 desc 
limit 10;
                                                                                       QUERY PLAN                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2173025.08..2173025.10 rows=10 width=32)
   ->  Sort  (cost=2173025.08..2173025.33 rows=101 width=32)
         Sort Key: (count(1)) DESC
         ->  Finalize GroupAggregate  (cost=2172997.31..2173022.89 rows=101 width=32)
               Group Key: url
               ->  Gather Merge  (cost=2172997.31..2173020.87 rows=202 width=32)
                     Workers Planned: 2
                     ->  Sort  (cost=2171997.28..2171997.53 rows=101 width=32)
                           Sort Key: url
                           ->  Partial HashAggregate  (cost=2171992.91..2171993.92 rows=101 width=32)
                                 Group Key: url
                                 ->  Parallel Seq Scan on app_pageview  (cost=0.00..2163462.00 rows=1706182 width=24)
                                       Filter: ((viewed_at >= '2019-01-01 00:00:00+00'::timestamp with time zone) AND (viewed_at <= '2019-01-31 00:00:00+00'::timestamp with time zone))
 JIT:
   Functions: 13
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(16 rows)

I have generated ~100M rows of dummy data to test this out.

How can I make it more efficient?

Would storing the viewed_at field as two different fields be any use (date and time)

More details for a smaller query (by honing in on a subset of the rows (where site_id='...')

The dummy data generated was for 1M rows per site_id – with 100 different site_id values. site_id is a btree indexed FK

postgres=# explain select count(1), url from app_pageview where site_id = '14efbc9e-332c-11ea-bbe1-0242ac130002' and  viewed_at > '2019-01-01' and viewed_at < '2019-01-31' group by 2 order by 1 desc limit 10;
                                                                                      QUERY PLAN                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=39877.45..39877.48 rows=10 width=32)
   ->  Sort  (cost=39877.45..39877.71 rows=101 width=32)
         Sort Key: (count(1)) DESC
         ->  Finalize GroupAggregate  (cost=39849.68..39875.27 rows=101 width=32)
               Group Key: url
               ->  Gather Merge  (cost=39849.68..39873.25 rows=202 width=32)
                     Workers Planned: 2
                     ->  Sort  (cost=38849.66..38849.91 rows=101 width=32)
                           Sort Key: url
                           ->  Partial HashAggregate  (cost=38845.29..38846.30 rows=101 width=32)
                                 Group Key: url
                                 ->  Parallel Index Scan using app_pageview_site_id_123e6520 on app_pageview  (cost=0.57..38771.07 rows=14844 width=24)
                                       Index Cond: (site_id = '14efbc9e-332c-11ea-bbe1-0242ac130002'::uuid)
                                       Filter: ((viewed_at > '2019-01-01 00:00:00+00'::timestamp with time zone) AND (viewed_at < '2019-01-31 00:00:00+00'::timestamp with time zone))
(14 rows)

table description:

postgres=# \d+ app_pageview                                                          Table "public.app_pageview"
  Column   |           Type           | Collation | Nullable |                 Default                  | Storage  | Stats target | Description 
-----------+--------------------------+-----------+----------+------------------------------------------+----------+--------------+-------------
 id        | integer                  |           | not null | nextval('app_pageview_id_seq'::regclass) | plain    |              | 
 url       | character varying(200)   |           | not null |                                          | extended |              | 
 referrer  | character varying(256)   |           | not null |                                          | extended |              | 
 viewed_at | timestamp with time zone |           | not null |                                          | plain    |              | 
 site_id   | uuid                     |           | not null |                                          | plain    |              | 
 view_hash | character varying(256)   |           |          |                                          | extended |              | 
 unique    | boolean                  |           | not null |                                          | plain    |              | 
Indexes:
    "app_pageview_pkey" PRIMARY KEY, btree (id)
    "app_pageview_site_id_123e6520" btree (site_id)
    "app_pageview_view_hash_d570ced8" btree (view_hash)
    "app_pageview_view_hash_d570ced8_like" btree (view_hash varchar_pattern_ops)
    "app_pageview_viewed_at_84306743" btree (viewed_at)
    "app_pageview_viewed_at_inverse" btree (viewed_at DESC)
Foreign-key constraints:
    "app_pageview_site_id_123e6520_fk_app_site_id" FOREIGN KEY (site_id) REFERENCES app_site(id) DEFERRABLE INITIALLY DEFERRED
Access method: heap

Best Answer

What is order by 1? i think combining that with limit 10 is resulting in postgress not knowing which 10, so it first gets all of them and then returns 10. However this theoretically shouldn't slow down the query.

But, do try

explain select count(*)
from app_pageview
where viewed_at between '2019-01-01' and '2019-01-31';

and see if that simplifies the situation.

for the query with siteid, you may benefit from a multicolumn index with siteid, viewed_at, although postgres is able to combine indexes somtimes. get the single-column query working well first and then experiment with this.