How to group the null results in this FULL OUTER JOIN with non-null responses

coalescehivejoin;

I am looking at the overlap and non-overlap (unique values) of users-ids from two different select statements using a full join. The main differentiation being that one table will have a deal_id = 0 and the other will have any deal_id greater than or equal to one.
I am joining the select statements on exchange_id, pub_id, and user_id but not on deal_id.

The field incremental value is trying to calculate users who are present in the pool deal_id >= 1 and not present in the pool of deal_id = 0 (a main reason for the full join).

Here is a simplification of the query I've typed up, it's in two select statements :

SET
hive.auto.convert.join = TRUE
;

SELECT
    First.deal_id
    ,COALESCE( First.exchange_id, Second.exchange_id ) as exchange_id
    ,COALESCE( First.pub_id, Second.pub_id ) as pub_id
    ,COUNT (DISTINCT(case when Second.user_id is null then First.user_id else null END)) AS Incremental
    ,SUM (First.imps) AS First_imps
    ,SUM (Second.imps) AS Second_imps
    FROM
        (
            SELECT
                a.deal_id
                ,a.exchange_id
                ,a.pub_id
                ,a.user_id
                ,1 AS imps
            FROM
                logs a 
            WHERE
                a.deal_id >= 1
            AND a.event_type = 'TRUE'
        ) First 
        FULL JOIN (
            SELECT
                a.exchange_id
                ,a.pub_id
                ,a.user_id
                ,1 AS imps
            FROM
                logs a
            WHERE
            a.deal_id = 0
            AND a.event_type = 'TRUE'
        ) Second
        ON (
            First.exchange_id = Second.exchange_id
            AND First.pub_id = Second.pub_id
            AND First.user_id = Second.user_id
        )
        GROUP BY
        First.deal_id
        ,COALESCE( First.exchange_id, Second.exchange_id )
        ,COALESCE( First.pub_id, Second.pub_id )
;

Here are the results I am seeing:

DEAL_ID    EXCHANGE_ID    PUB_ID    INCREMENTAL    FIRST_IMPS    SECOND_IMPS
/N         4              1780      0              0             15
/N         4              1560      0              0             32
3389       4              1780      2              7             6
1534       4              1560      4              9             8

And here is what I would like to see:

DEAL_ID    EXCHANGE_ID    PUB_ID    INCREMENTAL    FIRST_IMPS    SECOND_IMPS
3389       4              1780      2              7             21
1534       4              1560      4              9             40

Where the results with a null deal id match up to the results with a non-null deal id based on exchange_id and pub_id.

What can I do?

Similar to this problem but this solution isn't working for this problem.

Note: I've posted this question on stackoverflow here but thought I might try dba instead

Edit: Here is a sqlfiddle that replicates the problem, note that it's using PostgreSQL while I'm using hql

Best Answer

All this approach does in to make your original query a derived table then group by pub_id.

SET hive.auto.convert.join = TRUE;

SELECT max(DEAL_ID) as DEAL_ID
     , EXCHANGE_ID
     , PUB_ID
     , sum(INCREMENTAL) as INCREMENTAL
     , sum(FIRST_IMPS) as FIRST_IMPS
     , sum(SECOND_IMPS) as SECOND_IMPS
  FROM (

    SELECT First.deal_id
         , COALESCE( First.exchange_id, Second.exchange_id ) as exchange_id
         , COALESCE( First.pub_id, Second.pub_id ) as pub_id
         , COUNT(DISTINCT(case when Second.user_id is null then First.user_id else null END)) AS Incremental
        , SUM(First.imps) AS First_imps
        , SUM(Second.imps) AS Second_imps
    FROM (SELECT a.deal_id
               , a.exchange_id
               , a.pub_id
               , a.user_id
               , 1 AS imps
            FROM logs a 
           WHERE a.deal_id >= 1
            AND a.event_type = 'TRUE'
         ) First 
FULL JOIN (SELECT a.exchange_id
                , a.pub_id
                , a.user_id
               , 1 AS imps
            FROM logs a
           WHERE a.deal_id = 0
             AND a.event_type = 'TRUE'
          ) Second
      ON (   First.exchange_id = Second.exchange_id
         AND First.pub_id = Second.pub_id
         AND First.user_id = Second.user_id
         )
   GROUP BY First.deal_id
          , COALESCE( First.exchange_id, Second.exchange_id )
          , COALESCE( First.pub_id, Second.pub_id )

  ) group by pub_id, exchange_id

;