How and what is the most efficient way to join two tables, retaining a particular field from both

join;performance

I have two tables, with the following columns:

Table 1

  • ID
  • Component
  • Data1
  • Data2

Table 2

  • ID
  • Component
  • Data3
  • Data4

If I have rows in these tables, with items Component1, Component2, Component3, as entries in the component field for table 1, and Component3, Component4, and Component5 in the component field for table 2, I am looking to query such that I can get all results like:

Component1, data1, data2, data3, data4
Component2, data1, data2, data3, data4
Component3, data1, data2, data3, data4
Component4, data1, data2, data3, data4
Component5, data1, data2, data3, data4

Note that there are multiple entries of Component, and so I will need to group by the field to get aggregate summary data (using count, avg, etc. on the data fields). I am using SQL Server 2012 but need a solution that will work on 2005 ideally, a portable solution that works on other RDBMS would be preferable, but not necessary right now.

My question is two fold:

1) How do I perform this query? Inner, outer, and full joins seem useless, as they only let me retain Component from either one table or the other, not both. I'm unsure if a Cartesian product is the solution, I suspect it may be but I can't quite see how.

2) What is the most performant way of handling this, given that the number of entries in both tables will potentially be massive? Currently I have two SPs that get the summary data for the tables individually. As the SPs return only a handful of rows – maybe 20 to 30 in practise – because they summarise the data, is it more efficient/possible to perform the solution to part 1) above against the SPs rather than as a single query? An alternative I have is to simply merge the results of the SPs in code, I would rather do this on the database but efficiency will be quite important.

Any help appreciated- and I apologise if this question has been asked, it seems like it should be an obvious and common problem, but I couldn't find any help on Google et. al. though it could be that I was just using the wrong search terms!

Best Answer

It's not entirely clear what you are after but I think a FULL outer join would help:

SELECT 
    COALESCE(a.Component, b.Component) AS Component
  , COALESCE(a.data1, 0) AS data1
  , COALESCE(a.data2, 0) AS data2
  , COALESCE(b.data3, 0) AS data3
  , COALESCE(b.data4, 0) AS data4
FROM 
    table1 AS a
  FULL JOIN
    table2 AS b
      ON b.Component = a.Component ;

If Component is not UNIQUE on one (or both) of the tables, then you could aggreagte and then join:

WITH a AS
  ( SELECT 
        Component
      , COUNT(*) AS cnt
      , SUM(data1) AS sum_data1
      , SUM(data2) AS sum_data2
      -- ...
      , AVG(data1) AS avg_data1
      -- ...
    GROUP BY
        Component
    FROM
        table1
  )
  , b AS
  ( SELECT 
        Component
      , COUNT(*) AS cnt
      , SUM(data3) AS sum_data3
      -- ...
      , AVG(data3) AS avg_data3
      -- ...
    GROUP BY
        Component
    FROM
        table2
  ) 
    SELECT 
        COALESCE(a.Component, b.Component) AS Component
      , COALESCE(a.sum_data1,0) AS sum_data1
      -- ...
      , COALESCE(b.sum_data3,0) AS sum_data3
      -- ...
    FROM 
        a FULL JOIN b
            ON b.Component = a.Component ;