Rephrased, see original question below.
I have the following tables
- Table A (UserId, Age)
- Table B (UserId, Age)
For each row in table A I want exactly one row from table B that satisfies the following critera
- Do not exist in Table A.
- Has the same age as the record in table A.
- Has not been assigned to any other row in table A (i.e. the column UserIdFromTableB in the result set should be unique)
Assumptions
- There will be enough rows (20x) in table B to find a corresponding match by age.
- If this condition fails, NULL should be returned.
Original question
I have a table A (UserId, Age). For each record in table A, I would like to select exactly one records from table B (UserId, Age).
B.UserId must be distinct from A.UserId, but A.Age must be equal to B.Age.
Each record from table B should be joined at most once – each record in table A should be assigned a (random) record from table B, but any given row in table B should only be assigned to a single row in table A (or should be left out of the results altogether).
Is there a set-based solution for this? It will of course be possible to do it in a procedure, but as the number of rows in table A can be quite large.
Edit: Table A and table B are different tables and do not have the same number of rows. If there are multiple rows in B each satisfying the join criteria (Age), a random row should be chosen.
Best Answer
Looks like we're not getting any more information, so I'll go ahead and post a partial answer that might or might not be helpful to you. For future questions, including information about data types, indexes, how many rows the tables have ("quite large" means different things to different people), and your expected output can be very helpful for those of us trying to answer your question.
This answer contains a "set-based solution" that assumes that you always have at least twice as many rows in Table B compared to Table A for every age. This is well within your 20X estimate, but I do not meet the requirement of returning NULL if there aren't enough rows. With that said, that case will be obvious because the query will return NULLs for the
UserId
from Table B.For the data set, I assumed that Table A had 1 million rows spread evenly over 100 different ages and Table B had 20 million rows spread evenly over 100 different ages. There are no duplicate
UserIds
in either table for the same age, but rows can share anUserIds
for different ages in the same table. Neither table has any indexes. Code to create this data set:The algorithm for matching can be expressed in the following way:
TableA
hasm
rows for that age andTableB
hasn
rows for that age. As stated before,n >= 2 * m
.TableA
from1
tom
.TableB
from1
ton/2
. Each number is repeated twice, so the set looks like1, 1, 2, 2, ... n /2, n/2
.UserIds
.UserId
fromTableA
will match to either 1 or 2UserIds
fromTableB
. This is because the ids are unique within each table and age. The result set will have at most2 * m
rows.UserIds
fromTableB
that match to 2 UserIds fromTableB
to a single row. Keep an arbitraryUserId
fromTableB
.My query to implement that is a bit of a mess:
There are a few important tricks that I did in that query for performance reasons. The string casting in the aggregate is so that I can take an arbitrary row to resolve ties with a single
GROUP BY
. Normally I would just take the minimum or maximumUserId
but you said that you wanted random results and that would slightly bias theUserIds
chosen fromTableB
.I mapped rows from
TableB
in the way I did so that I could have an extra equality condition in the join:ta.RN = tb.RN
. Doing the mapping in a simpler way could lead to a join condition like this:tb.RN IN (2 * ta.RN, 2 * ta.RN - 1)
which cannot be efficiently checked in a hash or merge join.The query hint at the bottom is to help the query optimizer get a more efficient plan for my machine. All of the
ROW_NUMBER()
stuff makes it difficult for the query optimizer to come up with good cardinality estimates and to cost the plan correctly. For example, it has no idea that we've arranged the random sequences so that at most two rows match for every row inTableA
. You may not need that query hint or a different query hint may be better for your server and data.On my machine, the query returns 1000000 rows as expected in about 40 seconds. The query plan has a horrifically high estimated cost but that doesn't reflect reality. Here's what the plan looks like:
Here's a sample of the query's results:
Again, I need to stress that this query will not return accurate results unless you have at least twice as many rows in Table B compared to Table A for every different age.