How to obtain the most recent row per type and perform calculations, depending on the row type

google-bigqueryoptimization

I posted the same question on SO, but thought I might give it a go here as well, as any other kind of optimization and advice is greatly appreciated 🙂 Anyway, here is my post, word for word:

I need some help writing/optimizing a query to retrieve the latest version of each row by type and performing some calculations depending on the type. I think would be best if I illustrate it with an example.

Given the following dataset:

+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| id    | event_type        | event_timestamp     | message_id  | sent_at             | status | rate     |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| 1     | create            | 2016-11-25 09:17:48 | 1           | 2016-11-25 09:17:48 | 0      | 0.500000 |
| 2     | status_update     | 2016-11-25 09:24:38 | 1           | 2016-11-25 09:28:49 | 1      | 0.500000 |
| 3     | create            | 2016-11-25 09:47:48 | 2           | 2016-11-25 09:47:48 | 0      | 0.500000 |
| 4     | status_update     | 2016-11-25 09:54:38 | 2           | 2016-11-25 09:48:49 | 1      | 0.500000 |
| 5     | rate_update       | 2016-11-25 09:55:07 | 2           | 2016-11-25 09:50:07 | 0      | 1.000000 |
| 6     | create            | 2016-11-26 09:17:48 | 3           | 2016-11-26 09:17:48 | 0      | 0.500000 |
| 7     | create            | 2016-11-27 09:17:48 | 4           | 2016-11-27 09:17:48 | 0      | 0.500000 |
| 8     | rate_update       | 2016-11-27 09:55:07 | 4           | 2016-11-27 09:50:07 | 0      | 2.000000 |
| 9     | rate_update       | 2016-11-27 09:55:07 | 2           | 2016-11-25 09:55:07 | 0      | 2.000000 |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+

The expected result should be:

+------------+--------------------+--------------------+-----------------------+
| sent_at    | sum(submitted_msg) | sum(delivered_msg) | sum(rate_total)       |
+------------+--------------------+--------------------+-----------------------+
| 2016-11-25 |                  2 |                  2 |              2.500000 |
| 2016-11-26 |                  1 |                  0 |              0.500000 |
| 2016-11-27 |                  1 |                  0 |              2.000000 |
+------------+--------------------+--------------------+-----------------------+

At the end of the post is the query that is used to obtain this result. I'm willing to bet that there should be a way to optimize it, since it's using subqueries with joins, and from what I have read about BigQuery, joins should best be avoided. But first some background:

In essence, the dataset represents an append-only table, to which multipe events are written. The size of the data is in the hundreds of millions and will grow to billions+. Since Updates in BigQuery are not practical, and the data is being streamed to BQ, I need a way to retrieve the most recent of each events, perform some calculations based on the certain conditions and return an accurate result. The query is generated dynamically, based on user input, so more fields/calculations can be included, but have been ommited for simplicity.

  • There is only one create event, but n of any other kind
  • For each group of events, only the latest should be taken into account when doing the calculations.
    • status_update – updates the status
    • rate_update – updates the rate
    • create – self explanatory
  • Every event that is not create may not carry the rest of the information of the original/may not be accurate(except for message_id and the field that the event is operating on) (the dataset is simplified, but imagine there are many more columns, and more events will be added later)
    • E.g. a rate_update may or may not have the status field set, or be a value that is not the final, so no calculation can be made on the status field from a rate_update event and the same goes for status_update
  • It can be assumed that the table is partitioned by date and every query will make use of the partions. Those conditions have been omitted in favor of simplicity for now.

So I guess I have a couple of questions:

  • How can this query be optimized?
  • Will it be a better idea to put the events, other than create in their own tables, where the only fields available will be the ones relevant for the events, and needed for the joins(message_id, event_timestamp)? Will this reduce the amount of data processed?
  • What would be the most optimal way to add more events in the future, which will have their own conditions and calculations?

Actually any advice on how to query this dataset efficiently and friendly is more than welcome! Thank you! 🙂

The monstrosity I've come up with is the following. The INNER JOINS are used to retrieve the latest version of each row, as per this resource

    select
    sent_at as sent_at,
    sum(submitted_msg) as submitted,
    sum(delivered_msg) as delivered,
    sum(sales_rate_total) as sales_rate_total
    FROM (

      #DELIVERED
        SELECT 
            d.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            0 as submitted_msg,
            sum(if(status=1,1,0)) as delivered_msg,
            0 as sales_rate_total
        FROM `events` d
        INNER JOIN
                (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type = "status_update" 
                    group by 1
                    ) g on d.message_id = g.message_id and d.event_timestamp = g.ts
        GROUP BY 1,2

        UNION ALL

      #SALES RATE
        SELECT 
            s.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            0 as submitted_msg,
            0 as delivered_msg,
            sum(sales_rate) as sales_rate_total
        FROM `events` s
        INNER JOIN 
                    (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type in ("rate_update", "create")  
                    group by 1
                    ) f on s.message_id = f.message_id and s.event_timestamp = f.ts
        GROUP BY 1,2

        UNION ALL

      #SUBMITTED & REST
        SELECT 
            r.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            sum(if(status=0,1,0)) as submitted_msg,
            0 as delivered_msg,
            0 as sales_rate_total
        FROM `events` r
        INNER JOIN
                (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type = "create" 
                    group by 1
                    ) e on r.message_id = e.message_id and r.event_timestamp = e.ts
        GROUP BY 1, 2

    ) k
    group by 1

Best Answer

1: I've used SQL-SERVER in rextester to study your data but I think it can be applied to google-bigquery.
2. I've never worked with google-bigquery.
3: English is not my first language.
4. Can I have an aspirin?

First, I think there is something wrong in your result table. Using your subquery to obtain sales rate:

Check it here: http://rextester.com/CHX54701

select e.*
from events e
    inner join (
                select message_id, max(event_timestamp) as event_timestamp 
                from events 
                where event_type in ('rate_update', 'create')  
                group by message_id        
               ) t 
               on t.message_id = e.message_id and t.event_timestamp = e.event_timestamp;

+----+-------------+---------------------+------------+---------------------+--------+------+
| id |  event_type |   event_timestamp   | message_id |       sent_at       | status | rate |
+----+-------------+---------------------+------------+---------------------+--------+------+
|  1 |    create   | 25.11.2016 09:17:48 |      1     | 25.11.2016 09:17:48 |    0   |  0,5 |
+----+-------------+---------------------+------------+---------------------+--------+------+
|  9 | rate_update | 27.11.2016 09:55:07 |      2     | 25.11.2016 09:55:07 |    0   |   2  |
+----+-------------+---------------------+------------+---------------------+--------+------+
|  6 |    create   | 26.11.2016 09:17:48 |      3     | 26.11.2016 09:17:48 |    0   |  0,5 |
+----+-------------+---------------------+------------+---------------------+--------+------+
|  8 | rate_update | 27.11.2016 09:55:07 |      4     | 27.11.2016 09:50:07 |    0   |   2  |
+----+-------------+---------------------+------------+---------------------+--------+------+

sum(rate) for 2016-11-25 should be 3.0 instead of 2.5

Is this correct?, because if not all ends here.

In my humble opinion, you're grouping on every subquery all your records by message_id and getting max(event_timestamp) Then depending on event_type get the sum() of different values.

Then my first attempt has been to get all max(event_timestamp) grouped by message_id:

select message_id, event_type, max(event_timestamp) event_timestamp
from events
group by message_id, event_type;

+------------+---------------+---------------------+
| message_id |   event_type  |   event_timestamp   |
+------------+---------------+---------------------+
|      1     |     create    | 25.11.2016 09:17:48 |
|      2     |     create    | 25.11.2016 09:47:48 |
|      3     |     create    | 26.11.2016 09:17:48 |
|      4     |     create    | 27.11.2016 09:17:48 |
|      2     |  rate_update  | 27.11.2016 09:55:07 |
|      4     |  rate_update  | 27.11.2016 09:55:07 |
|      1     | status_update | 25.11.2016 09:24:38 |
|      2     | status_update | 25.11.2016 09:54:38 |
+------------+---------------+---------------------+

Then instead of use 3 UNIONS of 3 queries + subqueries, I think it can be accomplished with a single JOIN like this:

with ct as
(
  select message_id, event_type, max(event_timestamp) event_timestamp, convert(varchar(20),max(sent_at),112) st
  from events
  group by message_id, event_type
)
select 
       max(e.sent_at) sent_at, 
       sum(case when e.event_type='create' and status=0 then 1 else 0 end) as submitted_msg,
       sum(case when e.event_type='status_update' and status=1 then 1 else 0 end) as delivered_msg,
       sum(case when (e.event_type='create' or e.event_type='rate_update') and status=0 then rate else 0 end) as sum_rate  
from events e
     inner join ct on ct.message_id = e.message_id and ct.event_timestamp = e.event_timestamp
group by st
order by sent_at

I've reserched info about googe-bigquery and it allows to use CTE sentences, but you can rewrite it as a JOIN (SUBQUERY....

As you can see, I've used 3 different CASE to SUM desired values.

The final result is:

+---------------------+---------------+---------------+----------+
|       sent_at       | submitted_msg | delivered_msg | sum_rate |
+---------------------+---------------+---------------+----------+
| 25.11.2016 09:55:07 |       2       |       2       |     3    |
+---------------------+---------------+---------------+----------+
| 26.11.2016 09:17:48 |       1       |       0       |    0,5   |
+---------------------+---------------+---------------+----------+
| 27.11.2016 09:50:07 |       1       |       0       |    2,5   |
+---------------------+---------------+---------------+----------+

Check the final result here: http://rextester.com/FDIWA74637