PostgreSQL Group By – Group by Time Interval and Output Source and Destination Station_id and Count

gaps-and-islandspostgresqlpostgresql-10window functions

I am stuck with a query:

CREATE TABLE public.bulk_sample (
    serial_number character varying(255),
    validation_date timestamp,  -- timestamp of entry and exit
    station_id integer,
    direction integer           -- 1 = Entry | 2 = Exit
);

INSERT INTO public.bulk_sample VALUES
  ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:31:58', 120, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:50:22', 113, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:16:56', 113, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:47:06', 120, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:02:12', 120, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:47:45', 102, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 19:26:38', 102, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 20:17:24', 120, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 07:58:20', 119, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 08:43:35', 104, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 16:38:10', 104, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:15:01', 119, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:42:29', 119, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:48:05', 120, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:17:59', 120, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:25:25', 118, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:16:12', 118, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:32:51', 120, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:31:20', 120, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:39:33', 118, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 20:57:50', 118, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 21:16:25', 120, 2)
;

I have to create a query which gives a result as follows

source | dest | Count
120    | 113  |  1
113    | 120  |  1

I tried the following code but not able to get the desired result:

SELECT serial_number
     , count(*)
     , min(validation_date) AS start_time
     , CASE WHEN count(*) > 1 THEN max(validation_date) END AS end_time
FROM  (
   SELECT serial_number, validation_date, count(step OR NULL) OVER (ORDER BY serial_number, 
validation_date) AS grp
   FROM  (
      SELECT *
           , lag(validation_date) OVER (PARTITION BY serial_number ORDER BY validation_date)
           < validation_date - interval '60 min' AS step
      FROM   table1 
       where BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
      ) sub1
   ) sub2
GROUP  BY serial_number, grp;

The time interval is about 55 mins to 60 mins between every entry and exit.

I have also tried an inner join but not able to group by the time interval in an inner join

SELECT source.station_id AS source_station ,dest.station_id AS destination_station ,source.count FROM 
    (
        SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1
        WHERE 
            direction = 1 AND 
            validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59' 
        GROUP BY serial_number,station_id
    )source

 INNER JOIN 
    (
        SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1
        WHERE 
            direction = 2 AND 
            validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
        GROUP BY serial_number,station_id
    )dest
ON source.serial_number = dest.serial_number and source.station_id <> dest.station_id

The challenge is sometimes there is null in entry date and sometimes there is null in exit date.

Best Answer

This should be simplest and fastest while transactions per serial_number never overlap:

WITH cte AS (
   SELECT serial_number, validation_date, station_id, direction
        , row_number() OVER (PARTITION BY serial_number ORDER BY validation_date) AS rn
   FROM   bulk_sample
   WHERE  validation_date >= '2020-02-01'  -- ①
   AND    validation_date <  '2020-02-02'  -- entry & exit must be within time frame
   )
SELECT s.station_id AS source, d.station_id AS dest, count(*)
FROM   cte s
JOIN   cte d USING (serial_number)
WHERE  s.direction = 1
AND    d.rn = s.rn + 1
GROUP  BY 1, 2
ORDER  BY 1, 2;  -- optional sort order

db<>fiddle here

① I rewrote the WHERE condition to get all of Feb 1 2020 in optimal fashion. BETWEEN is almost always the wrong tool for time ranges. See:

How to add a day/night indicator to a timestamp column?

Also, '2020-02-01' is a perfectly valid timestamp constant 00:00:00 is assumed when the time component is missing.

While retrieving results for a given time frame, a plain btree index on (validation_date) is the optimum. For the complete table, an index on (serial_number, validation_date) would help more.

`validation_date IS NULL`?

The query keeps working while only the last destination per serial_number in the given time frame has validation_date IS NULL because NULL values happen to sort last in default ascending order. But it breaks with any other cases of validation_date IS NULL. You'll have to define more closely where those can pop up and how to deal with them exactly.

(2x) `uuid` instead of `varchar(255)` for `serial_number`?

Your serial_number seems to be a hexadecimal number with exactly 64 digits. If so, varchar(255) is a poor choice. See:

Should I add an arbitrary length limit to VARCHAR columns?

Moreover, a single uuid (32 hex digits) should suffice. If all 64 hex digits are needed, still consider 2 uuid columns. Smaller, faster, safer. Consider:

SELECT *
     , replace(uuid1::text || uuid2::text, '-', '') AS reverse_engineered
     , replace(uuid1::text || uuid2::text, '-', '') = serial_number AS identical
     , pg_column_size(serial_number) AS varchar_size
     , pg_column_size(uuid1) + pg_column_size(uuid2) AS uuid_size
FROM  (
   SELECT serial_number
        , left(serial_number, 32)::uuid  AS uuid1
        , right(serial_number, 32)::uuid AS uuid2
   FROM   bulk_sample
   ) sub;

db<>fiddle here

See:

Related Solutions

Postgres – Window Function Rank and Count

Your issue appears to be that you are applying the same WINDOW (named w) for both your COUNT(*) and your rank().

When you use a WINDOW which contains an ORDER BY clause, and you then apply certain aggregations such as SUM or COUNT, it applies the aggregation continuously across the ordering, which is why your COUNT and rank() are identical.

If you modify your query have multiple windows as

SELECT compUserId, rank, totalUsers 
FROM (
    SELECT cu.competition_user_id as compUserId, cu.user_id as userId,  
    count(*) OVER (PARTITION BY cu.competition_id) as totalUsers, 
    rank() OVER (PARTITION BY cu.competition_id ORDER BY cus.time_in_seconds ASC) as rank 
    FROM competition_users cu 
    LEFT JOIN current_competition_sessions ccs ON cu.competition_user_id = ccs.competition_user_id 
    LEFT JOIN competition_user_sessions cus ON cus.competition_user_session_id = ccs.competition_user_session_id 
    WHERE cu.left_competition = false 
    AND cu.competition_id in (:compIds)
) as sub 
WHERE compUserId in (:compUserIds);

so that you are only applying the PARTITION BY to your COUNT(*) window, and have both PARTITION BY and ORDER BY clauses for your rank(), I believe you'll get the results you want.

Refer to this SQL FIDDLE as a reference, where I have a generic id field, a com_num field to represent the competition id, and a com_time field to represent a competitors time.

Postgresql – Extract start and end per group of rows, where only the end can be identified

You can use count() as window function to identify groups like this:

SELECT vehicle, min(date_from) AS starts, max(date_to) AS stops
FROM  (
   SELECT vehicle, date_from, date_to
        , count(description LIKE 'RET%' OR NULL)
             OVER (PARTITION BY vehicle ORDER BY date_from DESC) AS grp
    FROM  tbl
   ) sub
GROUP  BY vehicle, grp;

Count in descending order, since the end of each group is significant. Then just extract minimum start and maximum end per grp in the outer SELECT.

SQL Fiddle.

Assuming all timestamp columns are defined NOT NULL.

Since timestamps are in ascending order (due to the logic of the problem), we don't need the id column at all for this.

This works for any number of vehicles, not just the one you demonstrate in your example.

Best Answer

validation_date IS NULL?

(2x) uuid instead of varchar(255) for serial_number?

Related Solutions

Postgres – Window Function Rank and Count

Postgresql – Extract start and end per group of rows, where only the end can be identified

Related Question

`validation_date IS NULL`?

(2x) `uuid` instead of `varchar(255)` for `serial_number`?