Calculating overlap between groups

oracleoracle-11g-r2pivot

I have a table with two columns of interest, item_id and bucket_id. There are a fixed number of values for bucket_id and I'm okay with listing them out if I need to.

Each item_id can appear multiple times, but each occurrence will have a separate bucket_id value. For example, the item_id of 123 can appear twice in the table, once under bucket_id of A, once under B.

My goal is to determine how much overlap exists between each pair of bucket_id values and display it as an N-by-N matrix.

For example, consider the following small example table:

item_id     bucket_id
=========   ===========
111         A
111         B
111         C

222         B
222         D

333         A
333         C

444         C

So for this dataset, buckets A and B have one item_id in common, buckets C and D have no items in common, etc.

I would like to get the above table formatted into something like the following:

        A       B       C       D
===================================
A       2       1       2       0
B       1       2       1       1
C       2       1       3       0
D       0       1       0       1

In the above table, the intersect of a row and column tells you how many records exist in both bucket_id values. For example, where the A row intersects the C column we have a 2, because there are 2 records that exist in both bucket_id A and C. Because the intersection of X and Y is the same as the intersection of Y and X, the above table is mirrored across the diagonal.

I imagine the query involves a PIVOT, but I can't for the life of me figure out how to get it working.

Note: the actual data I'm working with has a few hundred million entries but only a dozen or so buckets. I'd rather not join the table to itself if at all avoidable.

Best Answer

First, we need to get a list of all buckets. if you already have a table buckets with all bucket_id (as PK), this part is not needed:

with 
  buckets (bucket_id) as
  ( select distinct bucket_id
    from table_x
  ) ,

Then we get all bucket couples and the count of their common items - through a self-join:

  couples (bucket_a, bucket_b, cnt) as 
    ( select a.bucket_id, b.bucket_id, count(*)
      from table_x a 
        join table_x b
          on  a.bucket_id <= b.bucket_id
          and a.item_id = b.item_d 
      group by a.bucket_id, b.bucket_id
    )

Then we can show our results:

select a.bucket_id      as bucket_a, 
       b.bucket_id      as bucket_b, 
       coalesce(cnt, 0) as common_items
from buckets a
  join buckets b
    on a.bucket_id <= b.bucket_id
  left join couples c
    on  a.bucket_id = c.bucket_a
    and b.bucket_id = c.bucket_b ;

The above will show, for your sample table:

bucket_a  bucket_b  common_items
================================
A         A         2
A         B         1
A         C         2
A         D         0
B         B         2
B         C         1
B         D         1
C         C         3
C         D         0
D         D         1

If you want the pivoted output, you'll need one more processing.

If the number of buckets is very low as you say, you may try this one as well:

with 
  buckets (bucket_id) as
  ( select distinct bucket_id
    from table_x
  ) ,
select a.bucket_id      as bucket_a, 
       b.bucket_id      as bucket_b,
       ( select count(*)
         from table_x aa
           join table_x bb
             on  aa.item_id = bb.item_d 
         where a.bucket_id = aa.bucket_id
           and b.bucket_id = bb.bucket_id
       ) as common_items
from buckets a
  join buckets b
    on a.bucket_id <= b.bucket_id ;

Related Solutions

Oracle – Best Indexes for Three Columns Join

You should benchmark it using autotrace (http://betteratoracle.com/posts/10-using-autotrace) for example.

If you expect millions of rows to be returned, then I do not think indexing will help at all. Simple full table scan and hash joining is the most efficient way of executing this kind of query (https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:9422487749968).

Putting the least selective column first and using key compression can decrease the index size and reduce IO and CPU consumption (http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:8806017524595).

The following index should be quite sufficient if you fetch only few rows:

create unique index test1 on example(extype,studentid,exid) compress 1;

Mysql – Pivot table with MySQL

Give this a try,tested on my machine

First make sure GROUP_CONCAT has plenty of space:

SET SESSION group_concat_max_len = 1000000;

SET @sql = NULL;
SELECT
GROUP_CONCAT(DISTINCT
    CONCAT(
      'COUNT(DISTINCT CASE WHEN buyer_country= ''',
      buyer_country,
      ''' THEN transactions.buyer_id END) AS ',
      buyer_country
    )
   )INTO @sql
FROM  transactions 
LEFT JOIN buyers on transactions.buyer_id=buyers.buyer_id;


SET @sql=CONCAT('SELECT transactions.seller_country,',@sql,' FROM  transactions 
LEFT JOIN buyers on transactions.buyer_id=buyers.buyer_id GROUP BY transactions.seller_country');



PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

Link

Edit

ALTER TABLE transactions ADD key trans_idx (buyer_id)
ALTER TABLE buyers ADD key buyers_idx (buyer_id)

Best Answer

Related Solutions

Oracle – Best Indexes for Three Columns Join

Mysql – Pivot table with MySQL

Related Question