I am looking at the overlap and non-overlap (unique values) of users-ids from two different select statements using a full join. The main differentiation being that one table will have a deal_id = 0 and the other will have any deal_id greater than or equal to one.
I am joining the select statements on exchange_id, pub_id, and user_id but not on deal_id.
The field incremental value is trying to calculate users who are present in the pool deal_id >= 1 and not present in the pool of deal_id = 0 (a main reason for the full join).
Here is a simplification of the query I've typed up, it's in two select statements :
SET hive.auto.convert.join = TRUE ; SELECT First.deal_id ,COALESCE( First.exchange_id, Second.exchange_id ) as exchange_id ,COALESCE( First.pub_id, Second.pub_id ) as pub_id ,COUNT (DISTINCT(case when Second.user_id is null then First.user_id else null END)) AS Incremental ,SUM (First.imps) AS First_imps ,SUM (Second.imps) AS Second_imps FROM ( SELECT a.deal_id ,a.exchange_id ,a.pub_id ,a.user_id ,1 AS imps FROM logs a WHERE a.deal_id >= 1 AND a.event_type = 'TRUE' ) First FULL JOIN ( SELECT a.exchange_id ,a.pub_id ,a.user_id ,1 AS imps FROM logs a WHERE a.deal_id = 0 AND a.event_type = 'TRUE' ) Second ON ( First.exchange_id = Second.exchange_id AND First.pub_id = Second.pub_id AND First.user_id = Second.user_id ) GROUP BY First.deal_id ,COALESCE( First.exchange_id, Second.exchange_id ) ,COALESCE( First.pub_id, Second.pub_id ) ;
Here are the results I am seeing:
DEAL_ID EXCHANGE_ID PUB_ID INCREMENTAL FIRST_IMPS SECOND_IMPS /N 4 1780 0 0 15 /N 4 1560 0 0 32 3389 4 1780 2 7 6 1534 4 1560 4 9 8
And here is what I would like to see:
DEAL_ID EXCHANGE_ID PUB_ID INCREMENTAL FIRST_IMPS SECOND_IMPS 3389 4 1780 2 7 21 1534 4 1560 4 9 40
Where the results with a null deal id match up to the results with a non-null deal id based on exchange_id and pub_id.
What can I do?
Similar to this problem but this solution isn't working for this problem.
Note: I've posted this question on stackoverflow here but thought I might try dba instead
Edit: Here is a sqlfiddle that replicates the problem, note that it's using PostgreSQL while I'm using hql