I have a table with two columns of interest, item_id
and bucket_id
. There are a fixed number of values for bucket_id
and I'm okay with listing them out if I need to.
Each item_id
can appear multiple times, but each occurrence will have a separate bucket_id
value. For example, the item_id
of 123
can appear twice in the table, once under bucket_id
of A
, once under B
.
My goal is to determine how much overlap exists between each pair of bucket_id
values and display it as an N-by-N matrix.
For example, consider the following small example table:
item_id bucket_id
========= ===========
111 A
111 B
111 C
222 B
222 D
333 A
333 C
444 C
So for this dataset, buckets A
and B
have one item_id
in common, buckets C
and D
have no items in common, etc.
I would like to get the above table formatted into something like the following:
A B C D
===================================
A 2 1 2 0
B 1 2 1 1
C 2 1 3 0
D 0 1 0 1
In the above table, the intersect of a row and column tells you how many records exist in both bucket_id
values. For example, where the A
row intersects the C
column we have a 2
, because there are 2 records that exist in both bucket_id
A and C. Because the intersection of X and Y is the same as the intersection of Y and X, the above table is mirrored across the diagonal.
I imagine the query involves a PIVOT
, but I can't for the life of me figure out how to get it working.
Note: the actual data I'm working with has a few hundred million entries but only a dozen or so buckets. I'd rather not join the table to itself if at all avoidable.
Best Answer
First, we need to get a list of all buckets. if you already have a table
buckets
with allbucket_id
(as PK), this part is not needed:Then we get all bucket couples and the count of their common items - through a self-join:
Then we can show our results:
The above will show, for your sample table:
If you want the pivoted output, you'll need one more processing.
If the number of buckets is very low as you say, you may try this one as well: