I am stuck with a query:
CREATE TABLE public.bulk_sample (
serial_number character varying(255),
validation_date timestamp, -- timestamp of entry and exit
station_id integer,
direction integer -- 1 = Entry | 2 = Exit
);
INSERT INTO public.bulk_sample VALUES
('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:31:58', 120, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 08:50:22', 113, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:16:56', 113, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 10:47:06', 120, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:02:12', 120, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 16:47:45', 102, 2)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 19:26:38', 102, 1)
, ('019b5526970fcfcf7813e9fe1acf8a41bcaf5a5a5c10870b3211d82f63fbf270', '2020-02-01 20:17:24', 120, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 07:58:20', 119, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 08:43:35', 104, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 16:38:10', 104, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:15:01', 119, 2)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:42:29', 119, 1)
, ('23cc9678e8cf834decb096ba36be0efee418402bce03aab52e69026adfec7663', '2020-02-01 17:48:05', 120, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:17:59', 120, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 15:25:25', 118, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:16:12', 118, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 16:32:51', 120, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:31:20', 120, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 19:39:33', 118, 2)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 20:57:50', 118, 1)
, ('2a8f28bf0afc655210aa337aff016d33100282ac73cca660a397b924808499af', '2020-02-01 21:16:25', 120, 2)
;
I have to create a query which gives a result as follows
source | dest | Count
120 | 113 | 1
113 | 120 | 1
I tried the following code but not able to get the desired result:
SELECT serial_number
, count(*)
, min(validation_date) AS start_time
, CASE WHEN count(*) > 1 THEN max(validation_date) END AS end_time
FROM (
SELECT serial_number, validation_date, count(step OR NULL) OVER (ORDER BY serial_number,
validation_date) AS grp
FROM (
SELECT *
, lag(validation_date) OVER (PARTITION BY serial_number ORDER BY validation_date)
< validation_date - interval '60 min' AS step
FROM table1
where BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
) sub1
) sub2
GROUP BY serial_number, grp;
The time interval is about 55 mins to 60 mins between every entry and exit.
I have also tried an inner join but not able to group by the time interval in an inner join
SELECT source.station_id AS source_station ,dest.station_id AS destination_station ,source.count FROM
(
SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1
WHERE
direction = 1 AND
validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
GROUP BY serial_number,station_id
)source
INNER JOIN
(
SELECT serial_number,station_id,count(bulk_transaction_id) FROM table1
WHERE
direction = 2 AND
validation_date BETWEEN '2020-02-01 00:00:00' AND '2020-02-01 23:59:59'
GROUP BY serial_number,station_id
)dest
ON source.serial_number = dest.serial_number and source.station_id <> dest.station_id
The challenge is sometimes there is null in entry date and sometimes there is null in exit date.
Best Answer
This should be simplest and fastest while transactions per
serial_number
never overlap:db<>fiddle here
① I rewrote the
WHERE
condition to get all of Feb 1 2020 in optimal fashion.BETWEEN
is almost always the wrong tool for time ranges. See:Also, '2020-02-01' is a perfectly valid
timestamp
constant00:00:00
is assumed when the time component is missing.While retrieving results for a given time frame, a plain btree index on
(validation_date)
is the optimum. For the complete table, an index on(serial_number, validation_date)
would help more.validation_date IS NULL
?The query keeps working while only the last destination per
serial_number
in the given time frame hasvalidation_date IS NULL
becauseNULL
values happen to sort last in default ascending order. But it breaks with any other cases ofvalidation_date IS NULL
. You'll have to define more closely where those can pop up and how to deal with them exactly.(2x)
uuid
instead ofvarchar(255)
forserial_number
?Your
serial_number
seems to be a hexadecimal number with exactly 64 digits. If so,varchar(255)
is a poor choice. See:Moreover, a single
uuid
(32 hex digits) should suffice. If all 64 hex digits are needed, still consider 2uuid
columns. Smaller, faster, safer. Consider:db<>fiddle here
See: