Sql-server – Comparing two pairs of columns between two tables and returning the column from a third table

sql-server-2008

I have three tables:

Table1

table1 can have anywhere from 50K to 800k unique rows. By this, I mean that the combination of all the values in each row are unique, although sometimes, some columns will match up. It’s POSSIBLE that an entire row will match except the NAME column, but it’s very, very unlikely. The NAME column will always be unique.

The NAME column is type varchar(20). The remainder of the columns for each record are all type varchar(6), where every two columns is a set pair and the table has up to 21 set pairs (ie 43 columns total). Here is an example table1 showing one set pair for 4 records (I used single characters here for ease but don’t forget they are type varchar(6)):

table1

NAME        pair1_1     pair1_2     ...up to pair21_1 - pair21_2
00001A      A           B       
00002A      A           A       
00003A      B           C       
00004A      A           B       
…up to 800k rows

Table2

table2 is set up exactly the same way except it contains completely different records of the same exact type (here there can be anywhere from 1 to 200 rows)

table2

NAME        pair1_1    pair1_2    ...up to pair21_1 - pair21_2
1234B       A          B
5678B       A          A
9101B       C          C
1213B       A          B        
…up to 200 rows

Table3

table3 is associated with a single row in table2 and can be characterized as every possible result of a comparison between that single row in table2 to any possible row that COULD BE in table1. It’s probably best to call it by the NAME in table2 pertaining to it (let’s use just the first and call it table1234B because if there are 200 rows in table2, there are 200 different associated table3s).

This 3rd table will contain 4 rows. It has a NAME column which is a varchar(20) and 21 sets of 7 columns (each associated with a different pair from table1 and table2). The additional columns res1_1, res1_2…res21_7 are type decimal(30,7).

Here’s what it looks like:

table1234B

NAME        res1_1  res1_2  res1_3  res1_4  res1_5  res1_6  res1_7  ….res21_7
1234B       12.30   1.000   0.2500  1.000   2.000   2.10    25.00   ….

I want to use the combination of shared data in the pairs of columns from table1 and table2 (ie how they match up) as the condition to select data from table1234B (I will show some of this below). Only coinciding pairs will be compared. pair1_1 and pair1_2 from table1 will be compared to pair1_1 and pair1_2 from table2, the pair2s will only be compared with pair2s, pair3s only with pair3s etc. So, no cross pair comparisons are made (eg pair1 will never be compared with pair2)

In this example below, the fields in pair1_1 from each table match (A), and the fields in pair1_2 from each table match (B) but the fields don’t match between the columns of each table.

NAME        Pair1_1     Pair1_2
00001A      A           B       (from table1)
1234B       A           B       (from table2)

So I want to SELECT say pair1_4 from table1234B and associate it with the comparison between records 00001A and 1234B.

If the tables were like this we can see that all 4 of the fields match.

NAME        pair1_1     pair1_2
00001A      A           A       (from table1)
1234B       A           A       (from table2)

In this instance maybe I would want to choose pair1_1 from table1234B

Here we can see that pair1_1 in table1 matches both fields from table2 but pair1_2 from table1 does not match anything.

NAME        pair1_1     pair1_2
00001A      A           B       (from table1)
1234B       A           A       (from table2)

So I want to choose say pair1_4 from table1234B

The above are only 3 of the 14 possible ways the data in the column pairs can be shared between the tables (there are multiple ways they share no data between tables also) but only 7 possible columns for each pair to select from table1234B.

I want to select ALL of the values from table1234B that meet the criteria set forth from all of the possible sharing between the columns in each set of 2 from table1 and table2. It will look like something like this when complete:

1234BResult

NAME        RESULTPair1     …up to Resultpair21
00001A      12.30       (res1_1 from table3)        
00002B      1.000       (res1_2 from table3)                
00003C      25.00       (res1_7 from table3)        
00004A      1.000       (res1_4 from table3)
…up to 800K rows

Here is my edited query I’m starting with.

SELECT t1.NAME as NAME, t3.pair1_4 as RESULTPair1
FROM table1 t1
LEFT OUTER JOIN table2 t2
ON (t1.pair1_1 = t2.pair1_1 AND t1.pair1_1 <> t1.pair1_2 AND t1.pair1_2 = t2.pair1_2 AND t2.pair1_1 <> t2.pair1_2)
LEFT OUTER JOIN table1234B t3
ON t2.NAME = t3.NAME

…and the result of that query

NAME        RESULTPair1 
00001A      1.000       (res1_4 from table3)
00002B      NULL
00003A      NULL
00004A      1.000       (res1_4 from table3)

This is closer to what I am looking for as it leaves the NAMEs from table2 with different sharing open to other values from table1234B. I intend the remaining logic once incorporated would return the other appropriate results.

The trouble is that this only gives me the result of any situation that is AB, AB. I need to expand this to 21 pairs and potentially 200 table3s, including all of the different possible results from table3 if applicable. I can handle all the sharing logic (ie pair1_1 = pairt2_2 or pair1_1 <> pair2_2) it’s expanding it to the remaining situations and potentially more records from table2 that I am at a loss on. I need to get results for all 800k records in table1 for all 21 pairs.

If you’ve stayed with me this long and understand all that chicken scratch, my questions are:

  1. How would I edit my above query to work with a JOIN instead of a WHERE?
  2. Would there be a way to expand this to efficiently include the logic of the additional possible scenarios of sharing between the fields in table1 and table2 so that I can see or store in a new table all 50-800k results from the table1 to single row in table2 comparison?
  3. How the heck do I expand it to the additional 20 pairs?

EDIT
After being asked about the connection with table3 I realized that the column needed an edit for this to work properly. Im still having trouble figuring out how to make the query work right. I have edited the query too. Im still having a little trouble expanding it to multiple columns. It was suggested to use INTERSECT + EXCEPT or NOT EXISTS. I couldn't get INTERSECT to return anything whereas the query with the edit to include the left joins does.

Best Answer

I've got a working query. It joins 21 derived tables (one for each of the 21 pairs compared). They are joined with 20 FULL OUTER JOINs rather than LEFT JOINs, each with sets of 14 subqueries combined with UNION ALLs so I'll just give a general form with some comments. It does 67,000 comparisons in 5 seconds flat.

SELECT COALESCE (t4.NAME, t5.NAME,...t25.NAME) AS NAME,
                t4.RESULTPair1, t5.RESULTPair2,...t25.RESULTPair21           

FROM ((SELECT t1.NAME AS NAME, t3.RESULTPair1_1 AS RESULTPair1
     FROM table1 t1
     INNER JOIN table2 t2
     ON (t1.pair1_1 = t1.pair1_2 AND    ---logic between table1 and table2 for AAAA
         t1.pair1_1 = t2.pair1_1 AND
         t1.pair1_1 = t2.pair1_2)
     INNER JOIN table1234B t3
     ON t2.NAME = t3.NAME) 

     UNION ALL

     (SELECT t1.NAME AS NAME, t3.RESULTPair1_2 AS RESULTPair1
     FROM table1 t1
     INNER JOIN table2 t2
     ON (---LOGIC FOR AAAB---)  ---I wont bore you with the rest of the matching logic unless someone is really interested
     INNER JOIN table1234B t3
     ON t2.NAME = t3.NAME) 

     ------Repeat the above query for each set of logic AAAB, AABA, ABAA, BAAA etc. with 13
     ------total UNION ALL (each set of logics will take care of every possible situation of
     ------sharing between the tables and each t1.NAME is different so UNION ALL will be more
     ------efficient, COALESCE for the NAME field returns a full column for NAME)

    UNION ALL

    (SELECT t1.NAME AS NAME, t3.RESULTPair1_7 AS RESULTPair1
    FROM OFFENDERS t1
    INNER JOIN table2 t2
    ON  (---LOGIC FOR AABB---)
    INNER JOIN table1234B t3
    ON t2.NAME = t3.NAME)) t4 --aabb

     ——-now move through the remaining result pairs up to 21 (ie derived table25)
FULL OUTER JOIN
     ——-subqueries for result set2
ON t4.NAME = t5.NAME
     ——-subqueries for resultsets 3-20
ON t23.NAME = t24.NAME
FULL OUTER JOIN

    ((SELECT t1.NAME AS NAME, t3.RESULTPair21_1 AS RESULTPair21
    FROM table1 t1
    INNER JOIN table2 t2
    ON  (---LOGIC FOR AAAA---)
    INNER JOIN table1234B t3
    ON t2.NAME = t3.NAME)  ---aaaa

    UNION ALL

    -------remaining sets of logic for RESULTPair21

    SELECT t1.NAME AS NAME, t3.RESULTPair21_7 AS RESULTPair21
    FROM table1 t1
    INNER JOIN table2 t2
    ON  (---LOGIC FOR AABB---)
    INNER JOIN table3 t3
    ON t2.NAME = t3.NAME)) t25 --aabb

ON t24.NAME = t25.NAME 

It's a work in progress so if there's something I can do to improve it, let me know! Like I said in the above comments, I think I'm going to set up another table to index the varchar(6) fields to integers then use the integers for comparison. So now table1 and table2 will look like this:

table1
NAME        pair1_1     pair1_2     ...up to pair21_1 - pair21_2
00001A      1           2       
00002A      1           1       
00003A      2           3       
00004A      1           2       
…up to 800k rows 

table2
NAME        pair1_1    pair1_2    ...up to pair21_1 - pair21_2
1234B       1          2
5678B       1          1
9101B       3          3
1213B       1          2        
…up to 200 rows

...and the additional index to make comparison faster and in case I ever want to retrieve the actual sets of data:

intIndex
pairs    code
A        1
B        2
C        3
D        4
...
up to all possible varchar(6) character strings

I have added the index table and now have comparisons of integers instead of the varchar(6) columns. 67,000 table1 row query produces 21 columns in 5 seconds for the varchar(6) columns vs 4 seconds for the integer columns. 1.4 million calculations. Anyone got any other suggestions that might help me out?