Sql-server – How to join at least two tables with different column names but same data, and only show the matching data

sql server

I am trying to combine two (possibly more) tables that has different column names but the same data within the columns I am trying to line up. For example, I have a table called dbo.member and within this table is a column called UID. On another table called dbo.tasklist_data, there is a column called TaskID which holds some of the same data as UID. What I am looking for is to combine the two tables with the result giving me ONLY the UID's on the member table, that matches up with the data in the TaskID column on the tasklist_data table.

I do hope I explained that well enough. Please bear with me as I am new to this but eager to learn. Any help or advice will be greatly appreciated. Thank you.

Best Answer

From a high level, you'd simply do this:

SELECT *
FROM dbo.member m
    INNER JOIN dbo.tasklist_data tld ON m.UID = tld.TaskID;

The INNER JOIN will take rows from dbo.member where the UID column values match values contained in the TaskID column from the tasklist_data table. If you wanted ALL the rows from dbo.member, and only matching rows from dbo.tasklist_data, you'd rewrite that as:

SELECT *
FROM dbo.member m
    LEFT JOIN dbo.tasklist_data tld ON m.UID = tld.TaskID;

Steve Stedman has an excellent resource for understanding how the different types of JOIN statements work.

If you need to do this on more than two tables, you simply add JOIN clauses as necessary:

SELECT *
FROM dbo.member m
    INNER JOIN dbo.tasklist_data tld ON m.UID = tld.TaskID
    INNER JOIN dbo.some_other_table sot ON m.UID = sot.some_column_name;

I have broken a (arguably) cardinal sin above by using SELECT * to return all columns from all of the tables mentioned in the FROM clause of the query. It is generally considered good practice to explicitly specify the columns required, for instance:

SELECT m.UID
    , m.SomeColumn1
    , m.SomeColumn2
    , tld.TaskID
    , tld.SomeColumn3
FROM dbo.member m
    INNER JOIN dbo.tasklist_data tld ON m.UID = tld.TaskID;

Related Solutions

Sql-server – How to speed up a query that orders by a calculated field

If you don't really need zero-second actuality, you could just run your query time to time and cache the results.

If you still need to have real-time data on this (sacrificing insert performance), I would do this:

Since self-joins are not allowed in indexed views, you need to create two copies of each table:

CREATE TABLE personBrother
        (
        personId INT NOT NULL,
        brotherName INT NOT NULL
        )

CREATE TABLE personBrother2
        (
        personId INT NOT NULL,
        brotherName INT NOT NULL
        )

Create an indexed view on their join:

CREATE VIEW
        commonBrothers
WITH SCHEMABINDING
AS
        SELECT  p1.personId AS p1,
                p2.personId AS p2,
                COUNT_BIG(*) AS cnt
        FROM    dbo.personBrother p1
        JOIN    dbo.personBrother2 p2
        ON      p1.brotherName = p2.brotherName
        WHERE   p1.personId < p2.personId
        GROUP BY
                p1.personId, p2.personId

CREATE UNIQUE CLUSTERED INDEX
        ux_commonBrothers_p1_p2
ON      commonBrothers (p1, p2)

CREATE INDEX
        ix_commonBrothers_cnt
ON      commonBrothers (cnt)

Same for sisters.

You should manually maintain these tables to have same data (write a trigger, insert/update/delete both etc).

Now we can easily get pairs with the most brothers and most sisters:

SELECT  TOP 1 WITH TIES
        *
FROM    commonBrothers
ORDER BY
        cnt DESC

All we need now is to fetch a greatest sum. Unfortunately, we cannot index a join of these views (it's a pure implementation flaw, there's no theoretical limitation for this).

So we need to do the following: the top pair cannot have less brothers than the top sis pair. Same holds for the sisters. So we have this query:

SELECT  TOP 1 WITH TIES
        cb.p1, cb.p2, cb.cnt + cs.cnt AS totalCnt
FROM    commonBrothers cb
JOIN    commonSisters cs
ON      cs.p1 = cb.p1
        AND cs.p2 = cb.p2
WHERE   cs.cnt >=
        (
        SELECT  MAX(cst.cnt)
        FROM    (
                SELECT  TOP 1 WITH TIES
                        p1, p2
                FROM    commonBrothers 
                ORDER BY
                        cnt DESC
                ) cbt
        JOIN    commonSisters cst
        ON      cst.p1 = cbt.p1
                AND cst.p2 = cbt.p2
        )
        AND cb.cnt >=
        (
        SELECT  MAX(cbt.cnt)
        FROM    (
                SELECT  TOP 1 WITH TIES
                        p1, p2
                FROM    commonSisters
                ORDER BY
                        cnt DESC
                ) cst
        JOIN    commonBrothers cbt
        ON      cbt.p1 = cst.p1
                AND cbt.p2 = cst.p2
        )
ORDER BY
        totalCnt DESC

If the numbers of common brothers and sisters are correlated, this query will be very fast.

This solution has two drawbacks:

DML performance: if you insert or delete a record for a name shared by million brothers, the indexed view will get 2M inserts or delete. This is the price you pay for real-time query: the kind of data you are asking for cannot be easily indexed.
Persons with 0 brothers or 0 sisters will not be indexed. If there's a chance that top pair will not have brothers or sisters, you should amend the last query a little.

Sql-server – Listing indexes and constraints

These two queries may help you. The first will list all of the tables and indexes on those tables in your database. If the table does not appear in the list is does not have any indexes defined on it. These queries assume SQL Server version 2005 or newer.

SELECT 
    IndexName = QUOTENAME(I.name), 
    TableName =
        QUOTENAME(SCHEMA_NAME(T.[schema_id])) + 
        N'.' + QUOTENAME(T.name), 
    IsPrimaryKey = I.is_primary_key
FROM sys.indexes AS I
INNER JOIN sys.tables AS T
    ON I.[object_id] = T.[object_id]
WHERE
    I.type_desc <> N'HEAP'
ORDER BY 
    TableName ASC, 
    IndexName ASC;

The second query will report for each table the identity column, if any on each table in your database.

SELECT
    TableName =
        QUOTENAME(SCHEMA_NAME(T.[schema_id])) + 
        N'.' + QUOTENAME(T.name), 
    IdentityColumn = COALESCE(QUOTENAME(C.name), N'No identity column')
FROM sys.tables AS T
LEFT OUTER JOIN sys.columns AS C
    ON T.[object_id] = C.[object_id]
    AND C.is_identity = 1
ORDER BY
    TableName ASC;

To limit the queries to a specific table add a WHERE clause similar to:

WHERE T.name = N'NAME-OF-TABLE'

Best Answer

Related Solutions

Sql-server – How to speed up a query that orders by a calculated field

Sql-server – Listing indexes and constraints

Related Question