Hive – Over Partition By vs Where Conditions

hivewindow functions

I have a table partitioned this way.

# Partition Information
# col_name              data_type 

year                    string
month                   string
day                     string
hour                    string

I want to run the query on a whole month. I understand that I could do

where day > 01 and day < 31

But the table is VERY big. I was told it would take too much time and that I should run it separately on every day. I thought about over partition by. Like

select col1, col2, sum(col3) over (partition by day)
from table

but I am not sure how this will work. Would the query work on every day successively. Would it be easier on the cluster? Would col1 and col2 be grouped automatically or would I need to add group by?

Table
col1 col2 col3 month date....
1    s    4    01    01
1    s    3    01    01
1    q    5    01    01

What I want in the result

col1 col2 col3 month date....
1    s    7    01    01
1    q    5    01    01

Best Answer

In order to make use of the partitions -

Your query on the whole month should look something like this:

select      col1,col2,sum(col3),year,month,day
from        mytale
where       year  = '2017'
        and month = '02'
group by    col1,col2,year,month,day

Your query on a single day should look something like this:

select      col1,col2,sum(col3),year,month,day
from        mytale
where       year  = '2017'
        and month = '02'
        and day   = '01'
group by    col1,col2,year,month,day

Your query on a range of days should look something like this:

select      col1,col2,sum(col3),year,month,day
from        mytale
where       year  = '2017'
        and month = '02'
        and day between '01' and '07'
group by    col1,col2,year,month,day

Related Solutions

Correlate subquery in hive

The syntax of you query is not often called "correlated subquery". A correlated subquery would be like this:

SELECT 
    ndo.sku AS ParentSKU, 
    (
        SELECT SUM(g.sessions)
        FROM gasessiondata AS g
        WHERE g.date >= ndo.FromDate
          AND g.date <= ndo.ToDate
    ) AS s
FROM 
    Temp.NumberOfDaysOnline AS ndo ;

Your query is similar to a LATERAL join (CROSS / OUTER APPLY in SQL Server). This type of correlated joins has not been implemented in many DBMS.

It's easy to rewrite though so the join is done first and then the GROUP BY. This simple query should be supported by most, if not all DBMS:

SELECT 
    ndo.sku AS ParentSKU, 
    SUM(g.sessions) AS s
FROM 
        Temp.NumberOfDaysOnline AS ndo 
    JOIN
        gasessiondata AS g
    ON  g.date >= ndo.FromDate
    AND g.date <= ndo.ToDate
GROUP BY 
    ndo.PK,             -- the PRIMARY KEY of the table
    ndo.sku ;           -- and any column used in the SELECT list

Seems that Hive supports only equi-joins and cross joins. Moving the condition to the WHERE clause might solve this:

SELECT 
    ndo.sku AS ParentSKU, 
    SUM(g.sessions) AS s
FROM 
        Temp.NumberOfDaysOnline AS ndo 
    CROSS JOIN
        gasessiondata AS g
WHERE 
        g.date >= ndo.FromDate
    AND g.date <= ndo.ToDate
GROUP BY 
    ndo.PK,             -- the PRIMARY KEY of the table
    ndo.sku ;           -- and any column used in the SELECT list

How to Select Last Non-NULL Values per Column in SQL Server

Using the concatenation technique from The Last non NULL Puzzle by Itzik Ben Gan would look like this with your sample table and column data types.

select T.cat,
       cast(substring(
                     max(cast(T.t as binary(2)) + cast(T.val1 as binary(4))),
                     3,
                     4
                     ) as int),
       cast(substring(
                     max(cast(T.t as binary(2)) + cast(T.val2 as binary(1))),
                     3,
                     1
                     ) as char(1))
from #tab as T
group by T.cat;

Another way to write this query that divides the steps into CTE's to perhaps better show what is going on. It gives the exact same execution plan as the query above.

with C1 as
(
  -- Concatenate the ordering column with the value column
  select T.cat,
        cast(T.t as binary(2)) + cast(T.val1 as binary(4)) as val1,
        cast(T.t as binary(2)) + cast(T.val2 as binary(1)) as val2
  from #tab as T
),
C2 as
(
  -- Get the max concatenated value per group
  select C1.cat,
         max(C1.val1) as val1,
         max(C1.val2) as val2
  from C1
  group by C1.cat
)
-- Extract the value from the concatenated column
select C2.cat,
       cast(substring(C2.val1, 3, 4) as int) as val1,
       cast(substring(C2.val2, 3, 1) as char(1)) as val2
from C2;

This solution uses the fact that concatenating a null value with something results in a null value. SET CONCAT_NULL_YIELDS_NULL (Transact-SQL)