Postgresql – Cumulate count between date with 0 when there are records for a given day

postgresql

How could one write a Postgresql query that gives both daily and cumulative counts without gaps? E.g. if there is no data for a given day, it would show 0 as a daily count for that that day and the same cumulative sum as the previous day?

I think I'd need GENERATE_SERIES, but I don't know how to do it. I'm also not entirely sure if order by day asc rows between unbounded preceding and current row would work correctly always, but maybe not the biggest issue here.

I have tried to craft the query with PARTITION BY and

Let's say I write a table and add data such as

create table test
(
    id int4 NOT NULL GENERATED ALWAYS AS IDENTITY,  
    data int4,  
    created_at timestamptz NOT NULL DEFAULT now()
);

insert into test(data, created_at) values(1, '2021-04-01');
insert into test(data, created_at) values(2, '2021-04-01');
insert into test(data, created_at) values(3, '2021-04-02');
insert into test(data, created_at) values(4, '2021-04-03');
insert into test(data, created_at) values(5, '2021-04-05');
insert into test(data, created_at) values(6, '2021-04-07');

and then create a queries such as

SELECT
  created_at as "Date",
  count(1) as "Daily count"
FROM test
WHERE created_at >= '2021-04-01'
  AND created_at <= '2021-04-30'
GROUP BY 1

giving

Date Daily count
2021-04-01 00:00:00 2
2021-04-02 00:00:00 1
2021-04-03 00:00:00 1
2021-04-05 00:00:00 1
2021-04-07 00:00:00 1
with data as (
  select
    date_trunc('day', created_at) as day,
    count(1)
  from test
  group by 1
)
select
  day,
  running_total(count) over (order by day asc rows between unbounded preceding and current row)
from data
day running_total
2021-04-01 00:00:00 2
2021-04-02 00:00:00 3
2021-04-03 00:00:00 4
2021-04-05 00:00:00 5
2021-04-07 00:00:00 6

But as noted, how could these two be combined without gaps on daily values? Somehow it feels I get close but bump into some (syntax) problem. Maybe those two queries are the simplest and cleanest examples of what I'm thinking.

Best Answer

You write in your question:

I think I'd need GENERATE_SERIES, but I don't know how to do it.

Indeed you do - and you can as follows (all the code below is also given in the fiddle here):

CREATE TABLE cal_tab (cal_date) AS
(
  SELECT  GENERATE_SERIES
  (
    '2021-04-01'::DATE,
    '2021-04-07'::DATE,
    '1 DAY'
  )
);

Just to check:

SELECT cal_date::DATE FROM cal_tab;

Result:

cal_date
2021-04-01
2021-04-02
2021-04-03
...
... snipped for brevity - you determine how many rows there are with the 
... second parameter to GENERATE_SERIES
...

Then, I use your data:

CREATE TABLE test
(
    id         INT4 NOT NULL GENERATED ALWAYS AS IDENTITY,  
    data       INT4,  
    created_at DATE NOT NULL DEFAULT NOW()
);

Populate (slightly modified from the question - so as not to have the id the same as the provided data (i.e. 1...6), - instead, I explicitly inserted hundreds - improves legibility):

INSERT INTO test (data, created_at) VALUES (100, '2021-04-01');
INSERT INTO test (data, created_at) VALUES (100, '2021-04-01');

INSERT INTO test (data, created_at) VALUES (200, '2021-04-02');
INSERT INTO test (data, created_at) VALUES (300, '2021-04-03');
                                                               -- GAP for 4th of April
INSERT INTO test (data, created_at) VALUES (500, '2021-04-05');
                                                               -- GAP for 6th of April
INSERT INTO test (data, created_at) VALUES (600, '2021-04-07');

And then run the following SQL:

SELECT
  DISTINCT  -- try with and without DISTINCT
  ct.cal_date::DATE,
  COALESCE(t.data, 0) AS data, -- t.created_at::DATE,
  
  COUNT(t.created_at) OVER (PARTITION BY t.created_at 
                              ORDER BY ct.cal_date ASC) AS "Cnt/day", 
  COUNT(t.created_at) OVER (ORDER BY ct.cal_date ASC) AS "Cum. cnt/day",
  
  SUM(t.data) OVER (ORDER BY ct.cal_date::DATE ASC) AS "Sum/day"
FROM 
  cal_tab ct
LEFT OUTER JOIN test t
  ON ct.cal_date = t.created_at::DATE
ORDER BY ct.cal_date::DATE ASC;

Result:

  cal_date  data    Cnt/day Cum. cnt/day  Cum. sum/day
2021-04-01  100           2            2           200
2021-04-02  200           1            3           400
2021-04-03  300           1            4           700
2021-04-04    0           0            4           700
2021-04-05  500           1            5          1200
2021-04-06    0           0            5          1200
2021-04-07  600           1            6          1800

A result which, I believe covers all your requested elements.

In this case, you don't have to worry about the FRAME clause (i.e. the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) bit - FRAMEs are a kind of sub-partitioning method - but the long phrase above is actually the default - see here for a good introduction to window frames in PostgreSQL.

You could also (depending on your requirements) use a CTE (Common Table Expression) for your query if you don't wish to create a permanent calendar table (see my answer here for an example of this). (+1 for having made me think!).