Calculating overlap between groups

oracleoracle-11g-r2pivot

I have a table with two columns of interest, item_id and bucket_id. There are a fixed number of values for bucket_id and I'm okay with listing them out if I need to.

Each item_id can appear multiple times, but each occurrence will have a separate bucket_id value. For example, the item_id of 123 can appear twice in the table, once under bucket_id of A, once under B.

My goal is to determine how much overlap exists between each pair of bucket_id values and display it as an N-by-N matrix.

For example, consider the following small example table:

item_id     bucket_id
=========   ===========
111         A
111         B
111         C

222         B
222         D

333         A
333         C

444         C

So for this dataset, buckets A and B have one item_id in common, buckets C and D have no items in common, etc.

I would like to get the above table formatted into something like the following:

        A       B       C       D
===================================
A       2       1       2       0
B       1       2       1       1
C       2       1       3       0
D       0       1       0       1

In the above table, the intersect of a row and column tells you how many records exist in both bucket_id values. For example, where the A row intersects the C column we have a 2, because there are 2 records that exist in both bucket_id A and C. Because the intersection of X and Y is the same as the intersection of Y and X, the above table is mirrored across the diagonal.

I imagine the query involves a PIVOT, but I can't for the life of me figure out how to get it working.

Note: the actual data I'm working with has a few hundred million entries but only a dozen or so buckets. I'd rather not join the table to itself if at all avoidable.

Best Answer

First, we need to get a list of all buckets. if you already have a table buckets with all bucket_id (as PK), this part is not needed:

with 
  buckets (bucket_id) as
  ( select distinct bucket_id
    from table_x
  ) ,

Then we get all bucket couples and the count of their common items - through a self-join:

  couples (bucket_a, bucket_b, cnt) as 
    ( select a.bucket_id, b.bucket_id, count(*)
      from table_x a 
        join table_x b
          on  a.bucket_id <= b.bucket_id
          and a.item_id = b.item_d 
      group by a.bucket_id, b.bucket_id
    )

Then we can show our results:

select a.bucket_id      as bucket_a, 
       b.bucket_id      as bucket_b, 
       coalesce(cnt, 0) as common_items
from buckets a
  join buckets b
    on a.bucket_id <= b.bucket_id
  left join couples c
    on  a.bucket_id = c.bucket_a
    and b.bucket_id = c.bucket_b ;

The above will show, for your sample table:

bucket_a  bucket_b  common_items
================================
A         A         2
A         B         1
A         C         2
A         D         0
B         B         2
B         C         1
B         D         1
C         C         3
C         D         0
D         D         1

If you want the pivoted output, you'll need one more processing.


If the number of buckets is very low as you say, you may try this one as well:

with 
  buckets (bucket_id) as
  ( select distinct bucket_id
    from table_x
  ) ,
select a.bucket_id      as bucket_a, 
       b.bucket_id      as bucket_b,
       ( select count(*)
         from table_x aa
           join table_x bb
             on  aa.item_id = bb.item_d 
         where a.bucket_id = aa.bucket_id
           and b.bucket_id = bb.bucket_id
       ) as common_items
from buckets a
  join buckets b
    on a.bucket_id <= b.bucket_id ;