Mysql – Compare running / cumulative total with static target

aggregatehiveMySQLwindow functions

I have some daily data and monthly data.

I want to compare these two datasets, with Hive and perform a payout when the sum of actual for a month passes that respective months target. In the above dataset, every id has a monthly target of 6000. The payout should be 0 unless an id accumulates more than 6000 for that respective month from the daily data.

Example: Comp1 has a target of 6000 for month 1. When we look at the daily data, we can see Comp1 hits this target on 2020-01-06 by having an accumulated actual of 6600. At this point, the payout should be: (6600-6000)/100*1. Additionally, Comp1 should see a payout for every day following until the end of the month as they've passed their monthly target.

This is what I have tried:

select
    d.yyyy_mm_dd,
    d.name,
    d.id,
    d.actual,
    case when d.actual > t.target
        then (d.actual - t.target) / 100.0
        else 0
    end payout
from monthly_targets t
inner join (
    select yyyy_mm_dd, name, id, sum(cast(actual as int)) actual
    from daily_data
    group by yyyy_mm_dd, name, id
) d
    on month(d.yyyy_mm_dd) = t.month
    and d.name = t.name
where
    d.yyyy_mm_dd >= '2020-01-01' 
    and d.name in ('Comp1', 'Comp2')

However, this is only comparing the daily actual value to the monthly target. There will never be a payout this way as a monthly target will never be met on a single day. How can I amend the above so it calculates a cumulative actual for the month and compares on that instead?

I'm expecting output like this:

+------------+----+-------+--------+----------------+--------+
| yyyy_mm_dd | id | name  | actual | actual_to_date | payout |
+------------+----+-------+--------+----------------+--------+
| 2020-01-01 |  1 | Comp1 |   1100 |           1100 |      0 |
| 2020-01-02 |  1 | Comp1 |   1100 |           2200 |      0 |
| 2020-01-03 |  1 | Comp1 |   1100 |           3300 |      0 |
| 2020-01-04 |  1 | Comp1 |   1100 |           4400 |      0 |
| 2020-01-05 |  1 | Comp1 |   1100 |           5500 |      0 |
| 2020-01-06 |  1 | Comp1 |   1100 |           6600 |      6 |
| 2020-01-07 |  1 | Comp1 |   1100 |           7700 |     17 |
| 2020-01-08 |  1 | Comp1 |   1100 |           8800 |     28 |
| ...        |    |       |        |                |        |
| ...        |    |       |        |                |        |
| ...        |    |       |        |                |        |
| 2020-01-31 |  1 | Comp1 |   1100 |          34100 |    281 |
| 2020-02-01 |  1 | Comp1 |   1100 |           1100 |      0 |
| 2020-02-02 |  1 | Comp1 |   1100 |           2200 |      0 |
| 2020-02-03 |  1 | Comp1 |   1100 |           3300 |      0 |
+------------+----+-------+--------+----------------+--------+

Best Answer

I think I've now got a working solution. The below is giving the output as expected. It could probably be optimized a bit as it isn't the quickest (takes a couple minutes for only two companies). In my real dataset I have 65~ companies so if this could be improved, I'd love to see it!

SELECT
    x.yyyy_mm_dd,
    x.id,
    x.name,
    x.actual,
    x.target,
    x.actual_to_date,
    CASE WHEN x.actual_to_date > x.target THEN ((x.actual_to_date - x.target) /100) * 1 ELSE 0 END AS payout
FROM(
    SELECT
        daily.yyyy_mm_dd,
        daily.id,
        daily.name,
        daily.actual,
        t.target,
        SUM(daily.actual) OVER (PARTITION BY MONTH(daily.yyyy_mm_dd), daily.id ORDER BY daily.yyyy_mm_dd RANGE UNBOUNDED PRECEDING) AS actual_to_date
    FROM(
        SELECT
            yyyy_mm_dd,
            id,
            name,
            sum(cast(actual as int)) as actual
        FROM
            daily_data_table
        WHERE
            yyyy_mm_dd >= '2020-01-01'
        GROUP BY
            1,2,3
    ) daily
    INNER JOIN
        monthly_target_table t
        ON t.id = daily.id AND t.month = month(daily.yyyy_mm_dd)
    WHERE
        daily.name = 'Comp1'
) x