Hive – Over Partition By vs Where Conditions

hivewindow functions

I have a table partitioned this way.

# Partition Information
# col_name              data_type 

year                    string
month                   string
day                     string
hour                    string

I want to run the query on a whole month. I understand that I could do

where day > 01 and day < 31

But the table is VERY big. I was told it would take too much time and that I should run it separately on every day. I thought about over partition by. Like

select col1, col2, sum(col3) over (partition by day)
from table

but I am not sure how this will work. Would the query work on every day successively. Would it be easier on the cluster? Would col1 and col2 be grouped automatically or would I need to add group by?

Table
col1 col2 col3 month date....
1    s    4    01    01
1    s    3    01    01
1    q    5    01    01

What I want in the result

col1 col2 col3 month date....
1    s    7    01    01
1    q    5    01    01

Best Answer

In order to make use of the partitions -

Your query on the whole month should look something like this:

select      col1,col2,sum(col3),year,month,day
from        mytale
where       year  = '2017'
        and month = '02'
group by    col1,col2,year,month,day

Your query on a single day should look something like this:

select      col1,col2,sum(col3),year,month,day
from        mytale
where       year  = '2017'
        and month = '02'
        and day   = '01'
group by    col1,col2,year,month,day

Your query on a range of days should look something like this:

select      col1,col2,sum(col3),year,month,day
from        mytale
where       year  = '2017'
        and month = '02'
        and day between '01' and '07'
group by    col1,col2,year,month,day