PostgreSQL – Get Rows That Exist Exactly Once Per Day for a Given Period

postgresqlrelational-division

I have a case where I want to extract the device ids (DID) that exist once and only once for each day in a certain period. I have tried different methods and partitions but I seem to only be able to get that data individually per day (where date = X), but I need a query with where date between X and Y.

Example data:

DID date     
A   2019-01-01
A   2019-01-01
A   2019-01-02
A   2019-01-03
B   2019-01-01
B   2019-01-02
B   2019-01-03
C   2019-01-01
C   2019-01-02
C   2019-01-02
C   2019-01-03
D   2019-01-01
D   2019-01-02
D   2019-01-03

The query should return only B & D because those exists once for each day from 01 to 03.
I also wish to get the count, which would be 2 in this case

Best Answer

This is special case of relational-division. Predicate dates just happen to be in sequence and duplicates are excluded. There are many possible solutions. The best fit depends on the complete use case.

If, for example, you need to optimize performance and your table is big and the column did has high cardinality and typically only few did qualify and there is an index on (dt, did), then this query should be substantially faster than queries with GROUP BY & count():

SELECT did
FROM   tbl t1
JOIN   tbl t2  USING (did)
JOIN   tbl t3  USING (did)
WHERE  t1.dt = '2019-01-01'
AND    t2.dt = '2019-01-02'
AND    t3.dt = '2019-01-03'
AND    NOT EXISTS (SELECT FROM tbl t1x WHERE t1x.dt = '2019-01-01' AND t1x.did = t1.did AND t1x.ctid <> t1.ctid)
AND    NOT EXISTS (SELECT FROM tbl t2x WHERE t2x.dt = '2019-01-02' AND t2x.did = t2.did AND t2x.ctid <> t2.ctid)
AND    NOT EXISTS (SELECT FROM tbl t3x WHERE t3x.dt = '2019-01-03' AND t3x.did = t3.did AND t3x.ctid <> t3.ctid);

Why? Because it can answer your query with a few comparatively very cheap index (only) scans, excluding non-qualifying rows early, while queries based on GROUP BY always have to process the whole table.

ctid in my example is safe fallback to identify individual columns, while we don't know your actual design.

This gets verbose for long date ranges. It may pay to generate the query dynamically. Or a mixed approach might make sense. Or use a recursive CTE to make it short and still very fast:

WITH RECURSIVE rcte AS (
   SELECT did, dt
   FROM   tbl t
   WHERE  dt = '2019-01-01'    -- lower bound
   AND    NOT EXISTS (SELECT FROM tbl x WHERE x.dt = t.dt AND x.did = t.did AND x.ctid <> t.ctid)

   UNION ALL
   SELECT t.did, t.dt
   FROM   rcte r
   JOIN   tbl  t USING (did)
   WHERE  r.dt < '2019-01-03'  -- upper bound
   AND    t.dt = r.dt + 1
   AND    NOT EXISTS (SELECT FROM tbl x WHERE x.dt = t.dt AND x.did = t.did AND x.ctid <> t.ctid)
   )
SELECT array_agg(did) AS dids, count(*)
FROM   rcte
WHERE  dt = '2019-01-03'       -- found 1 row from lower to upper bound

This variant returns an array of qualifying IDs and a count.

db<>fiddle here

Related: (recommended reading!)

SQL query to find a row with a specific number of associations

It all depends on the actual, complete use case.

Also related:

SELECT rows based on indefinite number of filters

The query you have

You could simplify your query using a WINDOW clause, but that's just shortening the syntax, not changing the query plan.

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
             ORDER BY trans_date
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

Also using the slightly faster count(*), since id is certainly defined NOT NULL?
And you don't need to ORDER BY entity_id since you already PARTITION BY entity_id

You can simplify further, though:
Don't add ORDER BY to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);

Simpler, faster, but still just a better version of what you have, with static months.

The query you might want

... is not clearly defined, so I'll build on these assumptions:

Count transactions and amount for every 30-day period within the first and last transaction of any entity_id. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.

SELECT entity_id, trans_date
     , COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
     , COALESCE(sum(daily_count)  OVER w, 0) AS trans_count
FROM  (
   SELECT entity_id
        , generate_series (min(trans_date)::timestamp
                         , GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
                         , interval '1 day')::date AS trans_date
   FROM   transactiondb 
   GROUP  BY 1
   ) x
LEFT JOIN (
   SELECT entity_id, trans_date
        , sum(amount) AS daily_amount, count(*) AS daily_count
   FROM   transactiondb
   GROUP  BY 1, 2
   ) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
             ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);

This lists all 30-day periods for each entity_id with your aggregates and with trans_date being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...

The basic difficulty is the same as discussed here:

Referencing current row in FILTER clause of window function

The frame definition of a window cannot depend on values of the current row.

And rather call generate_series() with timestamp input:

Generating time series between two dates in PostgreSQL

The query you actually want

After question update and discussion:
Accumulate rows of the same entity_id in a 30-day window starting at each actual transaction.

Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL joins, yet:

SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
     , sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM   transactiondb t0
JOIN   transactiondb t1 USING (entity_id)
WHERE  t1.trans_date >= t0.trans_date
AND    t1.trans_date <  t0.trans_date + 30  -- exclude upper bound
-- AND    t0.entity_id = 114284  -- or pick a single entity ...
GROUP  BY t0.id  -- is PK!
ORDER  BY t0.trans_date, t0.id

SQL Fiddle.

A rolling window could only make sense (with respect to performance) with data for most days.

This does not aggregate duplicates on (trans_date, entity_id) per day, but all rows of the same day are always included in the 30-day window.

For a big table, a covering index like this could help quite a bit:

CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);

The last column amount is only useful if you get index-only scans out of it. Else drop it.

But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.

How to Improve Union for Consistent Results in PostgreSQL

Your query can be simplified in various respects:

SELECT to_char(day, 'DD/MM/YYYY') AS date
     , hashtag
     , count(d.*)::int AS count
FROM  (
       campaign c
CROSS  JOIN json_array_elements_text(c.configuration#>'{query,hashtags}') ch(hashtag)
CROSS  JOIN (SELECT g::date AS day
             FROM generate_series(timestamp '2016-01-20', '2016-01-26', interval '1 day') g) day 
      )
NATURAL LEFT JOIN (
   SELECT "createdAt"::date AS day, dh.hashtag
   FROM   data, json_array_elements_text(meta#>'{matchedOn,hashtags}') dh(hashtag)
   WHERE  "campaignId" = 1
   AND    "createdAt" >= '2016-01-20'
   AND    "createdAt" <  '2016-01-27'
   ) d
WHERE  c.id = 1
GROUP  BY day, hashtag
ORDER  BY day, hashtag, count;

This should be faster for multiple reasons. Not least it can use a multi-column index on data("campaignId", "createdAt") - which you should create unless you have it.

LEFT JOIN is the core feature you need. Then count the column, only non-null values count ...

One of the very rare cases where a NATURAL JOIN is useful, btw. You don't need it, though. It's just a minor syntax shortcut.

And rather use timestamp input for generate_series():

Generating time series between two dates in PostgreSQL

Best Answer

Related Solutions

PostgreSQL – How to Calculate Rolling Sum, Count, and Average Over Date Interval

The query you have

The query you might want

The query you actually want

How to Improve Union for Consistent Results in PostgreSQL

Related Question