How and what is the most efficient way to join two tables, retaining a particular field from both

join;performance

I have two tables, with the following columns:

Table 1

ID
Component
Data1
Data2

Table 2

ID
Component
Data3
Data4

If I have rows in these tables, with items Component1, Component2, Component3, as entries in the component field for table 1, and Component3, Component4, and Component5 in the component field for table 2, I am looking to query such that I can get all results like:

Component1, data1, data2, data3, data4
Component2, data1, data2, data3, data4
Component3, data1, data2, data3, data4
Component4, data1, data2, data3, data4
Component5, data1, data2, data3, data4

Note that there are multiple entries of Component, and so I will need to group by the field to get aggregate summary data (using count, avg, etc. on the data fields). I am using SQL Server 2012 but need a solution that will work on 2005 ideally, a portable solution that works on other RDBMS would be preferable, but not necessary right now.

My question is two fold:

1) How do I perform this query? Inner, outer, and full joins seem useless, as they only let me retain Component from either one table or the other, not both. I'm unsure if a Cartesian product is the solution, I suspect it may be but I can't quite see how.

2) What is the most performant way of handling this, given that the number of entries in both tables will potentially be massive? Currently I have two SPs that get the summary data for the tables individually. As the SPs return only a handful of rows – maybe 20 to 30 in practise – because they summarise the data, is it more efficient/possible to perform the solution to part 1) above against the SPs rather than as a single query? An alternative I have is to simply merge the results of the SPs in code, I would rather do this on the database but efficiency will be quite important.

Any help appreciated- and I apologise if this question has been asked, it seems like it should be an obvious and common problem, but I couldn't find any help on Google et. al. though it could be that I was just using the wrong search terms!

Best Answer

It's not entirely clear what you are after but I think a FULL outer join would help:

SELECT 
    COALESCE(a.Component, b.Component) AS Component
  , COALESCE(a.data1, 0) AS data1
  , COALESCE(a.data2, 0) AS data2
  , COALESCE(b.data3, 0) AS data3
  , COALESCE(b.data4, 0) AS data4
FROM 
    table1 AS a
  FULL JOIN
    table2 AS b
      ON b.Component = a.Component ;

If Component is not UNIQUE on one (or both) of the tables, then you could aggreagte and then join:

WITH a AS
  ( SELECT 
        Component
      , COUNT(*) AS cnt
      , SUM(data1) AS sum_data1
      , SUM(data2) AS sum_data2
      -- ...
      , AVG(data1) AS avg_data1
      -- ...
    GROUP BY
        Component
    FROM
        table1
  )
  , b AS
  ( SELECT 
        Component
      , COUNT(*) AS cnt
      , SUM(data3) AS sum_data3
      -- ...
      , AVG(data3) AS avg_data3
      -- ...
    GROUP BY
        Component
    FROM
        table2
  ) 
    SELECT 
        COALESCE(a.Component, b.Component) AS Component
      , COALESCE(a.sum_data1,0) AS sum_data1
      -- ...
      , COALESCE(b.sum_data3,0) AS sum_data3
      -- ...
    FROM 
        a FULL JOIN b
            ON b.Component = a.Component ;

Related Solutions

Join the two tables which possess one to many relationship

The relationship appears to be between content_field_mem_id.field_mem_id_value and ex.mem_id However, since content_field_mem_id.field_mem_id_value is an int and ex.mem_id is a varchar, you will need to cast/convert one of the columns as part of your join.

It would be more reliable (although slower) to convert the int column to a varchar.

You didn't say which DB technology you are using, so I can't show the exact syntax for the convert (it is very different between MS SQL/Oracle/DB2/etc).

MySQL JOIN two tables and get latest result

I see it as 2 steps:

Build tables with just the latest signal (or noise) for each device
JOIN or UNION the two tables.

Step 1 is a variant of groupwise max:

SELECT  device_id, stats_time, status, noise -- The desired columns
    FROM  ( SELECT  @prev := '' ) init
    JOIN  ( SELECT
                device_id != @prev AS first, -- `device_id` is the 'GROUP BY'
                @prev := device_id,          -- the 'GROUP BY'
                device_id, stats_time, status, noise -- Also the desired columns
            FROM  TableA -- The table
            ORDER BY  device_id  DESC, -- The 'GROUP BY'
                      stats_time DESC  -- to get latest
      ) x
    WHERE  first;

This may be beneficial to performance:

INDEX(device_id, stats_time)

Ditto for TableB and signal. Manually run them to see if I got them right.

Your example does not show a case where both signal and noise exist for the same device_id. I will assume that is really the case, hence UNION:

Step 2:

SELECT device_id, stats_time, status, signal, noise
    FROM
    ( SELECT device_id, stats_time, status, signal, '' AS noise
        ... (the rest of the signal query)
    )
    UNION ALL
    ( SELECT device_id, stats_time, status, '' AS signal, noise
        ... (the rest of the noise query)
    );

Best Answer

Related Solutions

Join the two tables which possess one to many relationship

MySQL JOIN two tables and get latest result

Related Question