Correlate subquery in hive

hivesubquery

SELECT ndo.sku ParentSKU, visitsWhenSKUWasOnline.s
FROM 
Temp.NumberOfDaysOnline ndo 
JOIN
(
    SELECT SUM(gasessiondata.sessions) as s
    FROM gasessiondata
    WHERE
    gasessiondata.date >= ndo.FromDate
    AND
    gasessiondata.date <= ndo.ToDate
  ) as visitsWhenSKUWasOnline
  ON 1=1

I can't use ndo.fromDate and ndo.ToDate inside of the subquery since hive does not support correlated sub queries. What could be an alternative for something like this?

Edit:

down vote
favorite
I have a table like this:

date          val

2016-1-1       8
2016-2-1       10
2016-1-2       30
2016-1-3       30

Now, I have two dates coming in from a different table, let's call them `fromdate` and todate

If fromdate was 1, 2016-1-1 and Todate was 2016-1-2, I need 8+30 = 38 as the final value

I need to check

SUM(val) of all those values dates corresponding to which are between the fromdate and twodate.

Best Answer

The syntax of you query is not often called "correlated subquery". A correlated subquery would be like this:

SELECT 
    ndo.sku AS ParentSKU, 
    (
        SELECT SUM(g.sessions)
        FROM gasessiondata AS g
        WHERE g.date >= ndo.FromDate
          AND g.date <= ndo.ToDate
    ) AS s
FROM 
    Temp.NumberOfDaysOnline AS ndo ;

Your query is similar to a LATERAL join (CROSS / OUTER APPLY in SQL Server). This type of correlated joins has not been implemented in many DBMS.

It's easy to rewrite though so the join is done first and then the GROUP BY. This simple query should be supported by most, if not all DBMS:

SELECT 
    ndo.sku AS ParentSKU, 
    SUM(g.sessions) AS s
FROM 
        Temp.NumberOfDaysOnline AS ndo 
    JOIN
        gasessiondata AS g
    ON  g.date >= ndo.FromDate
    AND g.date <= ndo.ToDate
GROUP BY 
    ndo.PK,             -- the PRIMARY KEY of the table
    ndo.sku ;           -- and any column used in the SELECT list

Seems that Hive supports only equi-joins and cross joins. Moving the condition to the WHERE clause might solve this:

SELECT 
    ndo.sku AS ParentSKU, 
    SUM(g.sessions) AS s
FROM 
        Temp.NumberOfDaysOnline AS ndo 
    CROSS JOIN
        gasessiondata AS g
WHERE 
        g.date >= ndo.FromDate
    AND g.date <= ndo.ToDate
GROUP BY 
    ndo.PK,             -- the PRIMARY KEY of the table
    ndo.sku ;           -- and any column used in the SELECT list