I have some daily data and monthly data.
I want to compare these two datasets, with Hive and perform a payout when the sum of actual
for a month passes that respective months target. In the above dataset, every id
has a monthly target of 6000. The payout should be 0 unless an id
accumulates more than 6000 for that respective month from the daily data.
Example: Comp1 has a target of 6000 for month 1. When we look at the daily data, we can see Comp1 hits this target on 2020-01-06
by having an accumulated actual
of 6600. At this point, the payout should be: (6600-6000)/100*1
. Additionally, Comp1 should see a payout for every day following until the end of the month as they've passed their monthly target.
This is what I have tried:
select
d.yyyy_mm_dd,
d.name,
d.id,
d.actual,
case when d.actual > t.target
then (d.actual - t.target) / 100.0
else 0
end payout
from monthly_targets t
inner join (
select yyyy_mm_dd, name, id, sum(cast(actual as int)) actual
from daily_data
group by yyyy_mm_dd, name, id
) d
on month(d.yyyy_mm_dd) = t.month
and d.name = t.name
where
d.yyyy_mm_dd >= '2020-01-01'
and d.name in ('Comp1', 'Comp2')
However, this is only comparing the daily actual
value to the monthly target. There will never be a payout this way as a monthly target will never be met on a single day. How can I amend the above so it calculates a cumulative actual
for the month and compares on that instead?
I'm expecting output like this:
+------------+----+-------+--------+----------------+--------+
| yyyy_mm_dd | id | name | actual | actual_to_date | payout |
+------------+----+-------+--------+----------------+--------+
| 2020-01-01 | 1 | Comp1 | 1100 | 1100 | 0 |
| 2020-01-02 | 1 | Comp1 | 1100 | 2200 | 0 |
| 2020-01-03 | 1 | Comp1 | 1100 | 3300 | 0 |
| 2020-01-04 | 1 | Comp1 | 1100 | 4400 | 0 |
| 2020-01-05 | 1 | Comp1 | 1100 | 5500 | 0 |
| 2020-01-06 | 1 | Comp1 | 1100 | 6600 | 6 |
| 2020-01-07 | 1 | Comp1 | 1100 | 7700 | 17 |
| 2020-01-08 | 1 | Comp1 | 1100 | 8800 | 28 |
| ... | | | | | |
| ... | | | | | |
| ... | | | | | |
| 2020-01-31 | 1 | Comp1 | 1100 | 34100 | 281 |
| 2020-02-01 | 1 | Comp1 | 1100 | 1100 | 0 |
| 2020-02-02 | 1 | Comp1 | 1100 | 2200 | 0 |
| 2020-02-03 | 1 | Comp1 | 1100 | 3300 | 0 |
+------------+----+-------+--------+----------------+--------+
Best Answer
I think I've now got a working solution. The below is giving the output as expected. It could probably be optimized a bit as it isn't the quickest (takes a couple minutes for only two companies). In my real dataset I have 65~ companies so if this could be improved, I'd love to see it!