T-sql – Join table exactly once

join;t-sql

Rephrased, see original question below.

I have the following tables

Table A (UserId, Age)
Table B (UserId, Age)

For each row in table A I want exactly one row from table B that satisfies the following critera

Do not exist in Table A.
Has the same age as the record in table A.
Has not been assigned to any other row in table A (i.e. the column UserIdFromTableB in the result set should be unique)

Assumptions

There will be enough rows (20x) in table B to find a corresponding match by age.
If this condition fails, NULL should be returned.

Original question

I have a table A (UserId, Age). For each record in table A, I would like to select exactly one records from table B (UserId, Age).

B.UserId must be distinct from A.UserId, but A.Age must be equal to B.Age.

Each record from table B should be joined at most once – each record in table A should be assigned a (random) record from table B, but any given row in table B should only be assigned to a single row in table A (or should be left out of the results altogether).

Is there a set-based solution for this? It will of course be possible to do it in a procedure, but as the number of rows in table A can be quite large.

Edit: Table A and table B are different tables and do not have the same number of rows. If there are multiple rows in B each satisfying the join criteria (Age), a random row should be chosen.

Best Answer

Looks like we're not getting any more information, so I'll go ahead and post a partial answer that might or might not be helpful to you. For future questions, including information about data types, indexes, how many rows the tables have ("quite large" means different things to different people), and your expected output can be very helpful for those of us trying to answer your question.

This answer contains a "set-based solution" that assumes that you always have at least twice as many rows in Table B compared to Table A for every age. This is well within your 20X estimate, but I do not meet the requirement of returning NULL if there aren't enough rows. With that said, that case will be obvious because the query will return NULLs for the UserId from Table B.

For the data set, I assumed that Table A had 1 million rows spread evenly over 100 different ages and Table B had 20 million rows spread evenly over 100 different ages. There are no duplicate UserIds in either table for the same age, but rows can share an UserIds for different ages in the same table. Neither table has any indexes. Code to create this data set:

DROP TABLE IF EXISTS dbo.TableA;
CREATE TABLE dbo.TableA (UserId BIGINT, Age TINYINT);

INSERT INTO dbo.TableA WITH (TABLOCK)
SELECT 1 + (t.RN % 10000)
, (t.RN - 1) / 10000
FROM
(
    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS dbo.TableB;
CREATE TABLE dbo.TableB (UserId BIGINT, Age TINYINT);

INSERT INTO dbo.TableB WITH (TABLOCK)
SELECT 1 + (t.RN % 200000)
, (t.RN - 1) / 200000
FROM
(
    SELECT TOP (20000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
    CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

The algorithm for matching can be expressed in the following way:

Process rows for a single age at a time. Suppose that TableA has m rows for that age and TableB has n rows for that age. As stated before, n >= 2 * m.
Assign random positive integers to the rows from TableA from 1 to m.
Assign random positive integers to the rows from TableB from 1 to n/2. Each number is repeated twice, so the set looks like 1, 1, 2, 2, ... n /2, n/2.
Join the two intermediate result sets on matching age, matching random numbers, and not matching UserIds.
Each UserId from TableA will match to either 1 or 2 UserIds from TableB. This is because the ids are unique within each table and age. The result set will have at most 2 * m rows.
Reduce the UserIds from TableB that match to 2 UserIds from TableB to a single row. Keep an arbitrary UserId from TableB.

My query to implement that is a bit of a mess:

SELECT
  t.Age
, t.UserId UserIdFromTableA
, CAST(RIGHT(magic_column, 20) AS BIGINT) UserIdFromTableB
FROM
(
    SELECT
      ta.Age
    , ta.UserId
    , MAX(
          RIGHT(REPLICATE('0', 10) + CAST(tb.RN AS VARCHAR(10)), 10) 
        + RIGHT(REPLICATE('0', 20) + CAST(tb.UserId AS VARCHAR(20)), 20)
    ) magic_column
    FROM
    (
        SELECT
          UserId
        , Age
        , ROW_NUMBER() OVER (PARTITION BY Age ORDER BY NEWID()) RN 
        FROM dbo.TableA
    ) ta
    LEFT OUTER JOIN 
    (
        SELECT
          UserId
        , Age
        , FLOOR(0.5 + 0.5 * ROW_NUMBER() OVER (PARTITION BY Age ORDER BY NEWID())) RN 
        FROM dbo.TableB
    ) tb ON ta.Age = tb.Age AND ta.RN = tb.RN AND ta.UserId <> tb.UserId
    GROUP BY ta.Age
    , ta.UserId
) t
OPTION (MAXDOP 1, HASH JOIN);

There are a few important tricks that I did in that query for performance reasons. The string casting in the aggregate is so that I can take an arbitrary row to resolve ties with a single GROUP BY. Normally I would just take the minimum or maximum UserId but you said that you wanted random results and that would slightly bias the UserIds chosen from TableB.

I mapped rows from TableB in the way I did so that I could have an extra equality condition in the join: ta.RN = tb.RN. Doing the mapping in a simpler way could lead to a join condition like this: tb.RN IN (2 * ta.RN, 2 * ta.RN - 1) which cannot be efficiently checked in a hash or merge join.

The query hint at the bottom is to help the query optimizer get a more efficient plan for my machine. All of the ROW_NUMBER() stuff makes it difficult for the query optimizer to come up with good cardinality estimates and to cost the plan correctly. For example, it has no idea that we've arranged the random sequences so that at most two rows match for every row in TableA. You may not need that query hint or a different query hint may be better for your server and data.

On my machine, the query returns 1000000 rows as expected in about 40 seconds. The query plan has a horrifically high estimated cost but that doesn't reflect reality. Here's what the plan looks like:

Here's a sample of the query's results:

Again, I need to stress that this query will not return accurate results unless you have at least twice as many rows in Table B compared to Table A for every different age.

Related Solutions

Mysql – How to write this self join based on three columns

The problem with your provided query is that you're doing an INNER JOIN. Because there are no rows with a parentid of 5, the subquery would return NULL on those rows and be excluded.

What you want is a LEFT JOIN, something like this (sqlfiddle):

SELECT t.*, parentchildrelation.parentid, childs
FROM tbl_info t
LEFT JOIN 
  (SELECT 
    parentid,
    COUNT(DISTINCT id) AS childs 
  FROM
    tbl_info  
  GROUP BY parentid) AS parentchildrelation 
  ON t.id = parentchildrelation.parentid 
WHERE t.no_daughter != parentchildrelation.childs 
   OR (t.no_daughter > 0 AND parentchildrelation.parentid IS NULL)
GROUP BY t.id, t.parentid;

I included more columns, because where id=5 and parentid=2 has two different values for no_daughter. You should be able to adjust those as needed.

Sql-server – the best way to get all data for a date range, plus the last event just before the range

I am going to assume that there isn't an index on the date columns, otherwise I think that the query would have been structured differently. If there is, you can probably find a better performing one than this.

The advantage of this query is that it can get all the data in one scan. The disadvantage is that it has to sort the data and join EventEmployee on the entire table. So as always, test with your own situation. This query also assumes that the MAX date is either unique or that equivalent rows would be acceptable.

USE AdventureWorks2012
GO
;
WITH Base AS (
   SELECT 
      TransactionHistory.*
      ,ProductVendor.BusinessEntityID
      ,MAX(CASE WHEN TransactionDate < '2008-08-01' THEN TransactionDate END) 
           OVER (PARTITION BY ProductVendor.BusinessEntityID) AS PreviousVendorTransaction
      ,COUNT(CASE WHEN TransactionDate >= '2008-08-01' THEN 1 END ) 
           OVER (PARTITION BY ProductVendor.BusinessEntityID) AS VendorAfterCutoff
   FROM
      Production.TransactionHistory
      -- Doesn't make the most sense, but I need a repeating relation
      INNER JOIN Purchasing.ProductVendor
         ON TransactionHistory.ProductID = ProductVendor.ProductID
),
Filtered AS (
   SELECT
      *
   FROM
      Base
   WHERE
      Base.TransactionDate >= '2008-08-01'
      OR (TransactionDate = PreviousVendorTransaction AND VendorAfterCutoff > 0)
)
SELECT DISTINCT
   TransactionID
   ,ProductID
   ,ReferenceOrderID
   ,ReferenceOrderLineID
   ,TransactionDate
   ,TransactionType
   ,Quantity
   ,ActualCost
   ,ModifiedDate
FROM
   Filtered

Edit:

Hmm, I think I may have to take back my comment on structuring it differently if there are indexes. The other suggestions that I have are probably fairly minor.

Make sure the query is using the indexes you're expecting it to. Start and End date to build temp table, end date to drive the previous event loop.
If the query to build the temp table is doing a lookup on the clustered index, it may be better to hold off and do that as part of the main query.
Try using a cte instead of a temp table. I think that a cte might be more competitive with the way that the query is structured below.
If you are returning a lot of events, it might be better to pull out the event table lookup to the main query to give the optimizer the option of doing a merge join.
I don't see a way of optimizing the previous event lookup short of an indexed view.

Here's a query that combines a few of those ideas.

SELECT
    e.[EventID]
INTO #EventTemp
FROM
    [Events] AS e
WHERE
    ( e.[EventStart] >= @StartDate AND e.[EventStart] <= @EndDate )
    OR ( e.[EventEnd] >= @StartDate AND e.[EventEnd] <= @EndDate )

;
WITH PrevEvent AS (
    SELECT
        EmpPrevEvent.[EventID]
    FROM
    (
        SELECT DISTINCT
            ee.[EmployeeID]
        FROM
            #EventTemp
            INNER JOIN [EventEmployee] AS ee ON
                #EventTemp.[EventID] = ee.[EventID]
    ) AS Emp
    CROSS APPLY (
        SELECT TOP 1
            e.[EventID]
        FROM
            [Events] AS e
            INNER JOIN [EventEmployee] AS ee ON
                e.[EventID] = ee.[EventID]
        WHERE
            ee.[EmployeeID] = Emp.[EmployeeID]
            AND e.[EventEnd] < @StartDate
        ORDER BY 
            e.[EventEnd] DESC
    ) AS EmpPrevEvent
)
SELECT
    e.[EventID],
    e.[EventStart],
    e.[EventEnd],
    e.[EventTypeID]
FROM
    [Events] AS e
WHERE
    e.EventID IN (
        SELECT EventID
        FROM #EventTemp
        UNION
        SELECT EventID
        FROM PrevEvent
    )

Best Answer

Related Solutions

Mysql – How to write this self join based on three columns

Sql-server – the best way to get all data for a date range, plus the last event just before the range

Related Question