T-sql – Join table exactly once

join;t-sql

Rephrased, see original question below.

I have the following tables

  • Table A (UserId, Age)
  • Table B (UserId, Age)

For each row in table A I want exactly one row from table B that satisfies the following critera

  • Do not exist in Table A.
  • Has the same age as the record in table A.
  • Has not been assigned to any other row in table A (i.e. the column UserIdFromTableB in the result set should be unique)

Assumptions

  • There will be enough rows (20x) in table B to find a corresponding match by age.
  • If this condition fails, NULL should be returned.

Original question

I have a table A (UserId, Age). For each record in table A, I would like to select exactly one records from table B (UserId, Age).

B.UserId must be distinct from A.UserId, but A.Age must be equal to B.Age.

Each record from table B should be joined at most once – each record in table A should be assigned a (random) record from table B, but any given row in table B should only be assigned to a single row in table A (or should be left out of the results altogether).

Is there a set-based solution for this? It will of course be possible to do it in a procedure, but as the number of rows in table A can be quite large.

Edit: Table A and table B are different tables and do not have the same number of rows. If there are multiple rows in B each satisfying the join criteria (Age), a random row should be chosen.

Best Answer

Looks like we're not getting any more information, so I'll go ahead and post a partial answer that might or might not be helpful to you. For future questions, including information about data types, indexes, how many rows the tables have ("quite large" means different things to different people), and your expected output can be very helpful for those of us trying to answer your question.

This answer contains a "set-based solution" that assumes that you always have at least twice as many rows in Table B compared to Table A for every age. This is well within your 20X estimate, but I do not meet the requirement of returning NULL if there aren't enough rows. With that said, that case will be obvious because the query will return NULLs for the UserId from Table B.

For the data set, I assumed that Table A had 1 million rows spread evenly over 100 different ages and Table B had 20 million rows spread evenly over 100 different ages. There are no duplicate UserIds in either table for the same age, but rows can share an UserIds for different ages in the same table. Neither table has any indexes. Code to create this data set:

DROP TABLE IF EXISTS dbo.TableA;
CREATE TABLE dbo.TableA (UserId BIGINT, Age TINYINT);

INSERT INTO dbo.TableA WITH (TABLOCK)
SELECT 1 + (t.RN % 10000)
, (t.RN - 1) / 10000
FROM
(
    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS dbo.TableB;
CREATE TABLE dbo.TableB (UserId BIGINT, Age TINYINT);

INSERT INTO dbo.TableB WITH (TABLOCK)
SELECT 1 + (t.RN % 200000)
, (t.RN - 1) / 200000
FROM
(
    SELECT TOP (20000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
    CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

The algorithm for matching can be expressed in the following way:

  • Process rows for a single age at a time. Suppose that TableA has m rows for that age and TableB has n rows for that age. As stated before, n >= 2 * m.
  • Assign random positive integers to the rows from TableA from 1 to m.
  • Assign random positive integers to the rows from TableB from 1 to n/2. Each number is repeated twice, so the set looks like 1, 1, 2, 2, ... n /2, n/2.
  • Join the two intermediate result sets on matching age, matching random numbers, and not matching UserIds.
  • Each UserId from TableA will match to either 1 or 2 UserIds from TableB. This is because the ids are unique within each table and age. The result set will have at most 2 * m rows.
  • Reduce the UserIds from TableB that match to 2 UserIds from TableB to a single row. Keep an arbitrary UserId from TableB.

My query to implement that is a bit of a mess:

SELECT
  t.Age
, t.UserId UserIdFromTableA
, CAST(RIGHT(magic_column, 20) AS BIGINT) UserIdFromTableB
FROM
(
    SELECT
      ta.Age
    , ta.UserId
    , MAX(
          RIGHT(REPLICATE('0', 10) + CAST(tb.RN AS VARCHAR(10)), 10) 
        + RIGHT(REPLICATE('0', 20) + CAST(tb.UserId AS VARCHAR(20)), 20)
    ) magic_column
    FROM
    (
        SELECT
          UserId
        , Age
        , ROW_NUMBER() OVER (PARTITION BY Age ORDER BY NEWID()) RN 
        FROM dbo.TableA
    ) ta
    LEFT OUTER JOIN 
    (
        SELECT
          UserId
        , Age
        , FLOOR(0.5 + 0.5 * ROW_NUMBER() OVER (PARTITION BY Age ORDER BY NEWID())) RN 
        FROM dbo.TableB
    ) tb ON ta.Age = tb.Age AND ta.RN = tb.RN AND ta.UserId <> tb.UserId
    GROUP BY ta.Age
    , ta.UserId
) t
OPTION (MAXDOP 1, HASH JOIN);

There are a few important tricks that I did in that query for performance reasons. The string casting in the aggregate is so that I can take an arbitrary row to resolve ties with a single GROUP BY. Normally I would just take the minimum or maximum UserId but you said that you wanted random results and that would slightly bias the UserIds chosen from TableB.

I mapped rows from TableB in the way I did so that I could have an extra equality condition in the join: ta.RN = tb.RN. Doing the mapping in a simpler way could lead to a join condition like this: tb.RN IN (2 * ta.RN, 2 * ta.RN - 1) which cannot be efficiently checked in a hash or merge join.

The query hint at the bottom is to help the query optimizer get a more efficient plan for my machine. All of the ROW_NUMBER() stuff makes it difficult for the query optimizer to come up with good cardinality estimates and to cost the plan correctly. For example, it has no idea that we've arranged the random sequences so that at most two rows match for every row in TableA. You may not need that query hint or a different query hint may be better for your server and data.

On my machine, the query returns 1000000 rows as expected in about 40 seconds. The query plan has a horrifically high estimated cost but that doesn't reflect reality. Here's what the plan looks like:

actual plan

Here's a sample of the query's results:

results

Again, I need to stress that this query will not return accurate results unless you have at least twice as many rows in Table B compared to Table A for every different age.