Postgresql – Cumulate count between date with 0 when there are records for a given day

postgresql

How could one write a Postgresql query that gives both daily and cumulative counts without gaps? E.g. if there is no data for a given day, it would show 0 as a daily count for that that day and the same cumulative sum as the previous day?

I think I'd need GENERATE_SERIES, but I don't know how to do it. I'm also not entirely sure if order by day asc rows between unbounded preceding and current row would work correctly always, but maybe not the biggest issue here.

I have tried to craft the query with PARTITION BY and

Let's say I write a table and add data such as

create table test
(
    id int4 NOT NULL GENERATED ALWAYS AS IDENTITY,  
    data int4,  
    created_at timestamptz NOT NULL DEFAULT now()
);

insert into test(data, created_at) values(1, '2021-04-01');
insert into test(data, created_at) values(2, '2021-04-01');
insert into test(data, created_at) values(3, '2021-04-02');
insert into test(data, created_at) values(4, '2021-04-03');
insert into test(data, created_at) values(5, '2021-04-05');
insert into test(data, created_at) values(6, '2021-04-07');

and then create a queries such as

SELECT
  created_at as "Date",
  count(1) as "Daily count"
FROM test
WHERE created_at >= '2021-04-01'
  AND created_at <= '2021-04-30'
GROUP BY 1

giving

Date	Daily count
2021-04-01 00:00:00	2
2021-04-02 00:00:00	1
2021-04-03 00:00:00	1
2021-04-05 00:00:00	1
2021-04-07 00:00:00	1

with data as (
  select
    date_trunc('day', created_at) as day,
    count(1)
  from test
  group by 1
)
select
  day,
  running_total(count) over (order by day asc rows between unbounded preceding and current row)
from data

day	running_total
2021-04-01 00:00:00	2
2021-04-02 00:00:00	3
2021-04-03 00:00:00	4
2021-04-05 00:00:00	5
2021-04-07 00:00:00	6

But as noted, how could these two be combined without gaps on daily values? Somehow it feels I get close but bump into some (syntax) problem. Maybe those two queries are the simplest and cleanest examples of what I'm thinking.

Best Answer

You write in your question:

I think I'd need GENERATE_SERIES, but I don't know how to do it.

Indeed you do - and you can as follows (all the code below is also given in the fiddle here):

CREATE TABLE cal_tab (cal_date) AS
(
  SELECT  GENERATE_SERIES
  (
    '2021-04-01'::DATE,
    '2021-04-07'::DATE,
    '1 DAY'
  )
);

Just to check:

SELECT cal_date::DATE FROM cal_tab;

Result:

cal_date
2021-04-01
2021-04-02
2021-04-03
...
... snipped for brevity - you determine how many rows there are with the 
... second parameter to GENERATE_SERIES
...

Then, I use your data:

CREATE TABLE test
(
    id         INT4 NOT NULL GENERATED ALWAYS AS IDENTITY,  
    data       INT4,  
    created_at DATE NOT NULL DEFAULT NOW()
);

Populate (slightly modified from the question - so as not to have the id the same as the provided data (i.e. 1...6), - instead, I explicitly inserted hundreds - improves legibility):

INSERT INTO test (data, created_at) VALUES (100, '2021-04-01');
INSERT INTO test (data, created_at) VALUES (100, '2021-04-01');

INSERT INTO test (data, created_at) VALUES (200, '2021-04-02');
INSERT INTO test (data, created_at) VALUES (300, '2021-04-03');
                                                               -- GAP for 4th of April
INSERT INTO test (data, created_at) VALUES (500, '2021-04-05');
                                                               -- GAP for 6th of April
INSERT INTO test (data, created_at) VALUES (600, '2021-04-07');

And then run the following SQL:

SELECT
  DISTINCT  -- try with and without DISTINCT
  ct.cal_date::DATE,
  COALESCE(t.data, 0) AS data, -- t.created_at::DATE,
  
  COUNT(t.created_at) OVER (PARTITION BY t.created_at 
                              ORDER BY ct.cal_date ASC) AS "Cnt/day", 
  COUNT(t.created_at) OVER (ORDER BY ct.cal_date ASC) AS "Cum. cnt/day",
  
  SUM(t.data) OVER (ORDER BY ct.cal_date::DATE ASC) AS "Sum/day"
FROM 
  cal_tab ct
LEFT OUTER JOIN test t
  ON ct.cal_date = t.created_at::DATE
ORDER BY ct.cal_date::DATE ASC;

Result:

  cal_date  data    Cnt/day Cum. cnt/day  Cum. sum/day
2021-04-01  100           2            2           200
2021-04-02  200           1            3           400
2021-04-03  300           1            4           700
2021-04-04    0           0            4           700
2021-04-05  500           1            5          1200
2021-04-06    0           0            5          1200
2021-04-07  600           1            6          1800

A result which, I believe covers all your requested elements.

In this case, you don't have to worry about the FRAME clause (i.e. the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) bit - FRAMEs are a kind of sub-partitioning method - but the long phrase above is actually the default - see here for a good introduction to window frames in PostgreSQL.

You could also (depending on your requirements) use a CTE (Common Table Expression) for your query if you don't wish to create a permanent calendar table (see my answer here for an example of this). (+1 for having made me think!).

Related Solutions

Postgresql – how to chain postgres RULEs

Next time, please include the EXPLAIN output rather than making us dig for it in your scripts. There's no guarantee my system is using the same plan as yours (although with your test data it is likely).

The rule system here is working properly. First, the I want to include my own diagnostic queries (note I did not run EXPLAIN ANALYSE since I was just interested in what query plan was generated):

rulestest=# explain DELETE FROM user_hits WHERE day = '2013-03-16';
                                              QUERY PLAN                        

--------------------------------------------------------------------------------
----------------------
 Delete on application_hits  (cost=0.00..3953181.85 rows=316094576 width=24)
   ->  Nested Loop  (cost=0.00..3953181.85 rows=316094576 width=24)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..128.53 rows=6352 width=22)
               ->  Nested Loop  (cost=0.00..96.78 rows=6352 width=22)
                     ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 wi
dth=10)
                           Filter: (day = '2013-03-16'::date)
                     ->  Materialize  (cost=0.00..2.49 rows=16 width=16)
                           ->  Nested Loop  (cost=0.00..2.41 rows=16 width=16)
                                 ->  Seq Scan on application_hits  (cost=0.00..1
.10 rows=4 width=10)
                                       Filter: (day = '2013-03-16'::date)
                                 ->  Materialize  (cost=0.00..1.12 rows=4 width=
10)
                                       ->  Seq Scan on client_hits  (cost=0.00..
1.10 rows=4 width=10)
                                             Filter: (day = '2013-03-16'::date)

 Delete on client_hits  (cost=0.00..989722.41 rows=79023644 width=18)
   ->  Nested Loop  (cost=0.00..989722.41 rows=79023644 width=18)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..43.83 rows=1588 width=16)
               ->  Nested Loop  (cost=0.00..35.89 rows=1588 width=16)
                     ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 wi
dth=10)
                           Filter: (day = '2013-03-16'::date)
                     ->  Materialize  (cost=0.00..1.12 rows=4 width=10)
                           ->  Seq Scan on client_hits  (cost=0.00..1.10 rows=4 
width=10)
                                 Filter: (day = '2013-03-16'::date)

 Delete on project_hits  (cost=0.00..248851.80 rows=19755911 width=12)
   ->  Nested Loop  (cost=0.00..248851.80 rows=19755911 width=12)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..16.91 rows=397 width=10)
               ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 width=10
)
                     Filter: (day = '2013-03-16'::date)

 Delete on user_hits  (cost=0.00..1887.00 rows=49763 width=6)
   ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=6)
         Filter: (day = '2013-03-16'::date)
(39 rows)

rulestest=# select distinct day from application_hits;
    day     
------------
 2013-03-15
 2013-03-16
(2 rows)

rulestest=# select count(*), day from application_hits group by day;
 count |    day     
-------+------------
     4 | 2013-03-15
     4 | 2013-03-16
(2 rows)

rulestest=# select count(*), day from client_hits group by day;
 count |    day     
-------+------------
     4 | 2013-03-15
     4 | 2013-03-16
(2 rows)

rulestest=# select count(*), day from project_hits group by day;
 count |    day     
-------+------------
   397 | 2013-03-15
   397 | 2013-03-16
(2 rows)

If your data is anything like your existing data, neither rules nor triggers will work very well. Better will be a stored procedure which you pass a value and it deletes everything you want.

First let's note that indexes here will get you nowhere because in all cases you are pulling half of the tables (I did add indexes on day on all tables to help the planner but this made no real difference).

You need to start with what you are doing with RULEs. RULEs basically rewrite queries and they do so using ways that are as robust as possible. Your code also doesn't match your example though it matches your question better. You have rules on tables which cascade to rules on other tables which cascade to rules on other tables

Therefore when you delete from user_hits where [criteria], the rules transform this into a set of queries:

DELETE FROM application_hits 
 WHERE day IN (SELECT day FROM client_hits 
               WHERE day IN (SELECT day FROM user_hits WHERE [condition]));
DELETE FROM client_hits
  WHERE day IN (SELECT day FROM user_hits WHERE [condition]);
DELETE FROM user_hits WHERE [condition];

Now, you might think we could skip the scan on client_hits in the first, but that isn't what happens here. The problem is that you could have days in user_hits and application_hits that are not in client_hits so you really have to scan all tables.

Now here there is no magic bullet. A trigger isn't going to work much better because, while it gets to avoid scanning every table, it gets fired every row that gets deleted so you basically end up with the same nested loop sequential scans that are currently killing performance. It will work a bit better because it will delete rows along the way rather than rewriting the query along the way, but it isn't going to perform very well.

A much better solution is to just define a stored procedure and have the application call that. Something like:

CREATE OR REPLACE FUNCTION delete_stats_at_date(in_day date) RETURNS BOOL 
LANGUAGE SQL AS
$$
DELETE FROM application_hits WHERE day = $1;
DELETE FROM project_hits WHERE day = $1;
DELETE FROM client_hits WHERE day  = $1;
DELETE FROM user_hits WHERE day = $1;
SELECT TRUE;
$$;

On the test data this runs in 280 ms on my laptop.

One of the hard things regarding RULEs is remembering what they are and noting that the computer cannot, in fact, read your mind. This is why I would not consider them a beginner's tool.

Postgresql – Altering a parent table in Postgresql 8.4 breaks child table defaults

Your problem is that when you add a new column to the_person, its child, the_person_two will have this field appended at the end of columns list (4th position), so after has_default column. See:

db=> \d temp_person
  Column   |       Type        |                            Modifiers                            
-----------+-------------------+-----------------------------------------------------------------
 person_id | integer           | not null default nextval('temp_person_person_id_seq'::regclass)
 name      | character varying | 
 foo       | text              | 

db=> \d temp_person_two 
   Column    |         Type         |                            Modifiers                            
-------------+----------------------+-----------------------------------------------------------------
 person_id   | integer              | not null default nextval('temp_person_person_id_seq'::regclass)
 name        | character varying    | 
 has_default | character varying(4) | not null default 'en'::character varying
 foo         | text                 |

So, when you execute this:

INSERT INTO temp_person_two VALUES ( NEW.* );

PostgreSQL will actually understand that you want to insert on the first three columns of temp_person_two (as NEW.* will expand to three values), generating something similar to this:

INSERT INTO temp_person_two(person_id,name,has_default)
VALUES ( NEW.person_id, NEW.name, NEW.foo );

So, temp_person_two.has_default will get the value of NEW.foo, which is NULL in your case.

The solution is to simply expand the column names:

INSERT INTO temp_person_two(person_id,name,foo)
VALUES ( NEW.person_id, NEW.name, NEW.foo );

or, you could also use this:

INSERT INTO temp_person_two(person_id,name,foo)
VALUES ( NEW.* );

But this is weak, as any changes on column positions may break your statements, so I'd recommend the first one.

EDIT:

So the conclusion and the lesson learned here is:

Always explicitly type the names of the columns and the values when issuing an INSERT command, in fact, when issuing any SQL command at all... =D

This will save you a lot of time solving problems like that in future.

Best Answer

Related Solutions

Postgresql – how to chain postgres RULEs

Postgresql – Altering a parent table in Postgresql 8.4 breaks child table defaults

Related Question