SQL Server – Generate Unique ID Based on Multiple Rows Criteria

sql serversql-server-2008-r2t-sql

I've got the following scenario, I've got a table like this

| pid | itemId |
|-----|--------|
|123  | item1  |
|123  | item2  |
|234  | item1  |
|234  | item3  |
|456  | item1  |
|456  | item2  |
|567  | item1  |
|567  | item2  |
|567  | item3  |

I need to get a unique integer id based on the unique combination of items. So if two persons have chosen the same items, the unique combination should be the same number.

For example:

| pid | itemId | comb |
|-----|--------|------|
|123  | item1  |  1   |
|123  | item2  |      |
|234  | item1  |  2   |
|234  | item3  |      |
|456  | item1  |  1   |
|456  | item2  |      |
|567  | item1  |  3   |
|567  | item2  |      |
|567  | item3  |      |

The number of possible combinations at the moment are around 300. So when filtering it might be easier to filter by just one number in one column than filtering with multiple criteria.

Best Answer

One way to approach this problem is to concatenate all of the item values together for each PID and to assign a value to them using the DENSE_RANK window function. Unfortunately, SQL Server does not make it easy to do string concatenation at an aggregate level until STRING_AGG() in SQL Server vNext. Since you're on SQL Server 2008 I'm going to use the FOR XML path method of string aggregation. An explanation of that method along with other ways of doing it can be found in Grouped Concatenation in SQL Server.

Data prep using your sample data:

CREATE TABLE #X165719 (
PID INTEGER NOT NULL,
ITEMID VARCHAR(10) NOT NULL
);

INSERT INTO #X165719
VALUES 
(123, 'item1'),
(123, 'item2'),
(234, 'item1'),
(234, 'item3'),
(456, 'item1'),
(456, 'item2'),
(567, 'item1'),
(567, 'item2'),
(567, 'item3');

First let's do the string aggregation. One implementation is as follows:

SELECT 
  PID
, ITEMID
, (
    SELECT ',' + SUB.ITEMID
    FROM #X165719 SUB
    WHERE SUB.PID = t1.PID
    ORDER BY SUB.ITEMID
    FOR XML PATH (''), TYPE).value('.', 'varchar(max)') all_items
FROM #X165719 t1

You should pick a delimiter that doesn't appear in your source data if possible. I used a comma. Here is what the result set looks like:

╔═════╦════════╦════════════════════╗
║ PID ║ ITEMID ║     all_items      ║
╠═════╬════════╬════════════════════╣
║ 123 ║ item1  ║ ,item1,item2       ║
║ 123 ║ item2  ║ ,item1,item2       ║
║ 234 ║ item1  ║ ,item1,item3       ║
║ 234 ║ item3  ║ ,item1,item3       ║
║ 456 ║ item1  ║ ,item1,item2       ║
║ 456 ║ item2  ║ ,item1,item2       ║
║ 567 ║ item1  ║ ,item1,item2,item3 ║
║ 567 ║ item2  ║ ,item1,item2,item3 ║
║ 567 ║ item3  ║ ,item1,item2,item3 ║
╚═════╩════════╩════════════════════╝

Now we need to assign a different number for each unique value in the all_items column. One way to accomplish this is with the DENSE_RANK function. Quoting from BOL:

Returns the rank of rows within the partition of a result set, without any gaps in the ranking. The rank of a row is one plus the number of distinct ranks that come before the row in question.

The final query is:

SELECT 
  t2.PID
, t2.ITEMID
, DENSE_RANK() OVER (ORDER BY all_items) COMB
FROM
(
    SELECT 
      PID
    , ITEMID
    , (
        SELECT ',' + SUB.ITEMID
        FROM #X165719 SUB
        WHERE SUB.PID = t1.PID
        ORDER BY SUB.ITEMID
        FOR XML PATH (''), TYPE).value('.', 'varchar(max)') all_items
    FROM #X165719 t1
) t2;

The final result set is:

╔═════╦════════╦══════╗
║ PID ║ ITEMID ║ COMB ║
╠═════╬════════╬══════╣
║ 123 ║ item1  ║    1 ║
║ 123 ║ item2  ║    1 ║
║ 456 ║ item1  ║    1 ║
║ 456 ║ item2  ║    1 ║
║ 567 ║ item1  ║    2 ║
║ 567 ║ item2  ║    2 ║
║ 567 ║ item3  ║    2 ║
║ 234 ║ item1  ║    3 ║
║ 234 ║ item3  ║    3 ║
╚═════╩════════╩══════╝