PostgreSQL – Get Rows That Exist Exactly Once Per Day for a Given Period

postgresqlrelational-division

I have a case where I want to extract the device ids (DID) that exist once and only once for each day in a certain period. I have tried different methods and partitions but I seem to only be able to get that data individually per day (where date = X), but I need a query with where date between X and Y.

Example data:

DID date     
A   2019-01-01
A   2019-01-01
A   2019-01-02
A   2019-01-03
B   2019-01-01
B   2019-01-02
B   2019-01-03
C   2019-01-01
C   2019-01-02
C   2019-01-02
C   2019-01-03
D   2019-01-01
D   2019-01-02
D   2019-01-03

The query should return only B & D because those exists once for each day from 01 to 03.
I also wish to get the count, which would be 2 in this case

Best Answer

This is special case of . Predicate dates just happen to be in sequence and duplicates are excluded. There are many possible solutions. The best fit depends on the complete use case.

If, for example, you need to optimize performance and your table is big and the column did has high cardinality and typically only few did qualify and there is an index on (dt, did), then this query should be substantially faster than queries with GROUP BY & count():

SELECT did
FROM   tbl t1
JOIN   tbl t2  USING (did)
JOIN   tbl t3  USING (did)
WHERE  t1.dt = '2019-01-01'
AND    t2.dt = '2019-01-02'
AND    t3.dt = '2019-01-03'
AND    NOT EXISTS (SELECT FROM tbl t1x WHERE t1x.dt = '2019-01-01' AND t1x.did = t1.did AND t1x.ctid <> t1.ctid)
AND    NOT EXISTS (SELECT FROM tbl t2x WHERE t2x.dt = '2019-01-02' AND t2x.did = t2.did AND t2x.ctid <> t2.ctid)
AND    NOT EXISTS (SELECT FROM tbl t3x WHERE t3x.dt = '2019-01-03' AND t3x.did = t3.did AND t3x.ctid <> t3.ctid);

Why? Because it can answer your query with a few comparatively very cheap index (only) scans, excluding non-qualifying rows early, while queries based on GROUP BY always have to process the whole table.

ctid in my example is safe fallback to identify individual columns, while we don't know your actual design.

This gets verbose for long date ranges. It may pay to generate the query dynamically. Or a mixed approach might make sense. Or use a recursive CTE to make it short and still very fast:

WITH RECURSIVE rcte AS (
   SELECT did, dt
   FROM   tbl t
   WHERE  dt = '2019-01-01'    -- lower bound
   AND    NOT EXISTS (SELECT FROM tbl x WHERE x.dt = t.dt AND x.did = t.did AND x.ctid <> t.ctid)

   UNION ALL
   SELECT t.did, t.dt
   FROM   rcte r
   JOIN   tbl  t USING (did)
   WHERE  r.dt < '2019-01-03'  -- upper bound
   AND    t.dt = r.dt + 1
   AND    NOT EXISTS (SELECT FROM tbl x WHERE x.dt = t.dt AND x.did = t.did AND x.ctid <> t.ctid)
   )
SELECT array_agg(did) AS dids, count(*)
FROM   rcte
WHERE  dt = '2019-01-03'       -- found 1 row from lower to upper bound

This variant returns an array of qualifying IDs and a count.

db<>fiddle here

Related: (recommended reading!)

It all depends on the actual, complete use case.

Also related: