I have a table partitioned this way.
# Partition Information
# col_name data_type
year string
month string
day string
hour string
I want to run the query on a whole month. I understand that I could do
where day > 01 and day < 31
But the table is VERY big. I was told it would take too much time and that I should run it separately on every day. I thought about over partition by. Like
select col1, col2, sum(col3) over (partition by day)
from table
but I am not sure how this will work. Would the query work on every day successively. Would it be easier on the cluster? Would col1 and col2 be grouped automatically or would I need to add group by?
Table
col1 col2 col3 month date....
1 s 4 01 01
1 s 3 01 01
1 q 5 01 01
What I want in the result
col1 col2 col3 month date....
1 s 7 01 01
1 q 5 01 01
Best Answer
In order to make use of the partitions -
Your query on the whole month should look something like this:
Your query on a single day should look something like this:
Your query on a range of days should look something like this: