Postgresql – How to summate weekly totals from a postgres date range type per user

aggregatedatetimepostgresql

I have the following table:

Schedules
+----+---------+------------+-------+--------------------------+
| id | user_id | project_id | hours | date_range               |
+----+---------+------------+-------+--------------------------+
|  1 |       1 |          1 |     4 | [2019-05-01, 2019-05-08) |
+----+---------+------------+-------+--------------------------+
|  2 |       2 |          1 |     8 | [2019-04-20, 2019-05-20) |
+----+---------+------------+-------+--------------------------+
|  3 |       1 |          2 |     2 | [2019-05-05, 2019-05-12) |
+----+---------+------------+-------+--------------------------+
| ...etc       |            |       |                          |
+--------------+------------+-------+--------------------------+

What I was hoping to accomplish with SQL was to get a summation of total hours per week given a specific date range input, for example date_start: 2019-05-05, date_end: 2019-05-11.

For the week of 2019-05-05 to 2019-05-11, any Schedule that has a date_range that has overlapping dates would get summated by project_id and by user_id.

Example: Row 1 from the above table has a date_range of [2019-05-01, 2019-05-08) so it would have 2019-05-05, 2019-05-06, 2019-05-07 (2019-05-08 omitted because exclusive )) dates that match the date_start/date_end criterion.

Since the hours are 4 for that row, the given total would be 4 * n days = 12 hours total

The date_start/date_end can be arbitrarily large, so say it was over a year range, every date_range would be added by week over the given range.

Potential expected output (open to suggestions):

| week_start |  week_end  | project_id | user_id | total_hours |
| 2019-05-05 | 2019-05-11 |          1 |       1 |          12 |
| 2019-05-05 | 2019-05-11 |          1 |       2 |          56 |
| etc ...    |            |            |         |             |

I'm not quite sure how to structure a query to generate matching date_rage days and multiply them out per user.

Can someone point me in the right direction?

Best Answer

Build week borders table (week_start, week_end) in the range from MIN(date_start) to MAX(date_end) in CTE. Then cross-join it with your table, select only overlapped pairs and calculate values you need. - akina

Related Solutions

Postgresql – how to chain postgres RULEs

Next time, please include the EXPLAIN output rather than making us dig for it in your scripts. There's no guarantee my system is using the same plan as yours (although with your test data it is likely).

The rule system here is working properly. First, the I want to include my own diagnostic queries (note I did not run EXPLAIN ANALYSE since I was just interested in what query plan was generated):

rulestest=# explain DELETE FROM user_hits WHERE day = '2013-03-16';
                                              QUERY PLAN                        

--------------------------------------------------------------------------------
----------------------
 Delete on application_hits  (cost=0.00..3953181.85 rows=316094576 width=24)
   ->  Nested Loop  (cost=0.00..3953181.85 rows=316094576 width=24)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..128.53 rows=6352 width=22)
               ->  Nested Loop  (cost=0.00..96.78 rows=6352 width=22)
                     ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 wi
dth=10)
                           Filter: (day = '2013-03-16'::date)
                     ->  Materialize  (cost=0.00..2.49 rows=16 width=16)
                           ->  Nested Loop  (cost=0.00..2.41 rows=16 width=16)
                                 ->  Seq Scan on application_hits  (cost=0.00..1
.10 rows=4 width=10)
                                       Filter: (day = '2013-03-16'::date)
                                 ->  Materialize  (cost=0.00..1.12 rows=4 width=
10)
                                       ->  Seq Scan on client_hits  (cost=0.00..
1.10 rows=4 width=10)
                                             Filter: (day = '2013-03-16'::date)

 Delete on client_hits  (cost=0.00..989722.41 rows=79023644 width=18)
   ->  Nested Loop  (cost=0.00..989722.41 rows=79023644 width=18)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..43.83 rows=1588 width=16)
               ->  Nested Loop  (cost=0.00..35.89 rows=1588 width=16)
                     ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 wi
dth=10)
                           Filter: (day = '2013-03-16'::date)
                     ->  Materialize  (cost=0.00..1.12 rows=4 width=10)
                           ->  Seq Scan on client_hits  (cost=0.00..1.10 rows=4 
width=10)
                                 Filter: (day = '2013-03-16'::date)

 Delete on project_hits  (cost=0.00..248851.80 rows=19755911 width=12)
   ->  Nested Loop  (cost=0.00..248851.80 rows=19755911 width=12)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..16.91 rows=397 width=10)
               ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 width=10
)
                     Filter: (day = '2013-03-16'::date)

 Delete on user_hits  (cost=0.00..1887.00 rows=49763 width=6)
   ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=6)
         Filter: (day = '2013-03-16'::date)
(39 rows)

rulestest=# select distinct day from application_hits;
    day     
------------
 2013-03-15
 2013-03-16
(2 rows)

rulestest=# select count(*), day from application_hits group by day;
 count |    day     
-------+------------
     4 | 2013-03-15
     4 | 2013-03-16
(2 rows)

rulestest=# select count(*), day from client_hits group by day;
 count |    day     
-------+------------
     4 | 2013-03-15
     4 | 2013-03-16
(2 rows)

rulestest=# select count(*), day from project_hits group by day;
 count |    day     
-------+------------
   397 | 2013-03-15
   397 | 2013-03-16
(2 rows)

If your data is anything like your existing data, neither rules nor triggers will work very well. Better will be a stored procedure which you pass a value and it deletes everything you want.

First let's note that indexes here will get you nowhere because in all cases you are pulling half of the tables (I did add indexes on day on all tables to help the planner but this made no real difference).

You need to start with what you are doing with RULEs. RULEs basically rewrite queries and they do so using ways that are as robust as possible. Your code also doesn't match your example though it matches your question better. You have rules on tables which cascade to rules on other tables which cascade to rules on other tables

Therefore when you delete from user_hits where [criteria], the rules transform this into a set of queries:

DELETE FROM application_hits 
 WHERE day IN (SELECT day FROM client_hits 
               WHERE day IN (SELECT day FROM user_hits WHERE [condition]));
DELETE FROM client_hits
  WHERE day IN (SELECT day FROM user_hits WHERE [condition]);
DELETE FROM user_hits WHERE [condition];

Now, you might think we could skip the scan on client_hits in the first, but that isn't what happens here. The problem is that you could have days in user_hits and application_hits that are not in client_hits so you really have to scan all tables.

Now here there is no magic bullet. A trigger isn't going to work much better because, while it gets to avoid scanning every table, it gets fired every row that gets deleted so you basically end up with the same nested loop sequential scans that are currently killing performance. It will work a bit better because it will delete rows along the way rather than rewriting the query along the way, but it isn't going to perform very well.

A much better solution is to just define a stored procedure and have the application call that. Something like:

CREATE OR REPLACE FUNCTION delete_stats_at_date(in_day date) RETURNS BOOL 
LANGUAGE SQL AS
$$
DELETE FROM application_hits WHERE day = $1;
DELETE FROM project_hits WHERE day = $1;
DELETE FROM client_hits WHERE day  = $1;
DELETE FROM user_hits WHERE day = $1;
SELECT TRUE;
$$;

On the test data this runs in 280 ms on my laptop.

One of the hard things regarding RULEs is remembering what they are and noting that the computer cannot, in fact, read your mind. This is why I would not consider them a beginner's tool.

Postgresql – A costly approach to filtering on a calculated date

WARNING! Depending on authors response, my answer could be way off. Awaiting his reponse.

reconbot, we need more info, because it seems like your entire subquery

SELECT DISTINCT ON (client_id) id
FROM orders
WHERE
  start_date <= ship_date and (end_date is NULL OR end_date >= ship_date)
  AND client_id = items_to_work.client_id
ORDER BY client_id, order_type DESC

is completely redundant. You appear to be re-checking for already checked conditions.

For example, try just running

SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work;

and you get the exact same results. See the SQL Fiddle here. If you're still concerned about distinct client_ids, you could also use

SELECT *
FROM (
  SELECT DISTINCT ON (client_id)
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
    -- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work;

as in this SQL Fiddle.

So, I'm either misunderstanding, or your originally given data set doesn't reflect the significance of your sub-SELECT in the original question.

IGNORE EVERYTHING BELOW HERE...

Until we get clarification from the author...

I think I've got your solution. At the very least, I've made a SQLFiddle with the results, and it appears that it will be much less costly.

`LATERAL` sub-queries in PostgreSQL

I love chances when a LATERAL sub-query can be used to save some time in your query. Unfortunately, I think I'm pretty bad at explaining when and where it should be used, and I'm only OK at recognizing instances of when to use it. It just doesn't come up too often in my particular query designs.

Take a look at the Postgres documentation on LATERAL keyword for some ideas, and also I really like this SlideShare presentation by Markus Winand for helping to explain LATERAL a bit better. In essence, it has a flavor of a "for each" statement in typical pseudo-coding vernacular.

The reason I looked into it for your case was: you were building the items_to_work table, and then using the client_id attribute of items_to_work in your sub-SELECT, where you checked if items_to_work.order_id was IN the distinct returned set of orders.id values. Using items_to_work attributes in the WHERE clause of the sub-SELECT was the red flag for me.

OK, so I realize the explanation of my motivation isn't so hot, sorry! :P On to the results...

New Query

Without further ado, here it is:

SELECT *
FROM (
  SELECT
    order_items.id item_id,
    order_items.product_id,
    order_items.quantity,
    order_items.lead_days,
    orders.id order_id,
    orders.client_id client_id,
    (DATE '2014-07-18' + order_items.lead_days) ship_date
  FROM order_items
  INNER JOIN orders ON orders.id = order_items.order_id
  WHERE
-- find all order items that have active orders on the production_date + lead time
    DATE '2014-07-18' >= (orders.start_date - order_items.lead_days)
    AND (
      DATE '2014-07-18' <= (orders.end_date - order_items.lead_days)
      OR orders.end_date IS NULL
    )
) items_to_work,
LATERAL(
  SELECT id FROM orders
  WHERE orders.client_id = items_to_work.client_id
  AND start_date <= ship_date and (end_date is NULL OR end_date >= ship_date)
  AND items_to_work.order_id = orders.id) lat;

Also, you can note that we get back identical results as those you are expecting, except for a single added column on the right side, as part of the LATERAL selection, which is a replication of the orders.id. I'm not super experienced with LATERAL (it doesn't come up too often in my work), so if that's a problem for you, we can sort out a way to drop it. :P

`EXPLAIN` results

So, we don't have the big data set which you have, so that we can really test out the results of the new query versus the old. I'm relying on the EXPLAIN estimates here, but...

Using the Old SQLFiddle, we can see that the overall estimated cost is approximately 283,000 Postgres units. :P

Using the New SQLFiddle, we can see the much better estimated cost of only 204!!