I query YouTube Data Api for a list of most popular videos on a channel and then get their statistics, 4 times per hour (each 15 minutes, by cron). The data is stored in Postgres, but dumping it and loading into another SQL DB wouldn't be a trouble.
Now I have following table of data:
video_id| views_count | likes_count | timestamp
---------+-------------+-------------+---------------------
foo | 100 | 1 | 2018-12-01 12:01:03
foo | 101 | 1 | 2018-12-01 12:16:06
foo | 105 | 1 | 2018-12-01 12:31:01
bar | 199 | 0 | 2018-12-01 12:01:02
bar | 200 | 0 | 2018-12-01 12:16:08
bar | 301 | 5 | 2018-12-01 12:31:02
... | ...
UPD: Here's the schema (pasted to sqlfiddle):
CREATE TABLE video_statistics
(
video_id TEXT not null,
views_count INTEGER not null,
likes_count INTEGER not null,
timestamp TIMESTAMPTZ not null
);
How should I query that data in order to get increments by hour in view_counts
and likes_count
columns, grouped by video?
To clarify what I want to get:
hour_of_day|video_id|views_increment|likes_increment
-----------+--------+---------------+---------------
... | ...
11 | foo | 4 | 0
12 | foo | 5 | 1
... | ...
11 | bar | 73 | 0
12 | bar | 102 | 5
... | ...
In other words, it's a "best time to post video" based on historical data, taking into account data during many weeks and months.
Should I rather dump the data into some timeseries DB or other, more appropriate for such cases DB, and query it there? Or should I just resort to calculating this in code?
Best Answer
One possibility is to first
row_number()
the records to get the first and last value per video, day and hour. Then join the two sets of first and last values to get the respective differences. Group the result on video and hour and get the sum or the average per video per day.