Sql-server – How to query based on related table without returning records from related table

sql server

Table_A

^ID^Name^
|1 |Joe |
|2 |Jane|

Table_B

^ID^Foreign_ID^Company  ^Specialty^
|1 |1         |Microsoft|SQL      |
|2 |1         |Microsoft|PHP      |
|3 |2         |Oracle   |SQL      |
|4 |2         |Oracle   |HTML5    |

The query:

SELECT Name, Company
FROM   Table_A
         LEFT JOIN Table_B
         ON Table_A.ID = Table_B.Foreign_ID
WHERE  Table_B = 'Microsoft'

Say I'm formatting a report, and I'm only interested in employee's that work at Microsoft and I only want 1 record per employee, how do I prevent the Speciality column in Table_B from creating multiple rows?

Best Answer

You don't need a join, since you have no columns from table_B in the output. You could use a semi-join, i.e an EXISTS subquery:

SELECT a.Name, 'Microsoft' AS Company
FROM Table_A AS a
WHERE EXISTS 
      ( SELECT *
        FROM Table_B AS b
        WHERE a.ID = b.Foreign_ID
          AND b.Company = 'Microsoft'
      ) ;

If you did want some other column from table_B (say the Specialty) and still only 1 row, you could use either a derived table with a window function, or CROSS APPLY with TOP 1. CROSS APPLY is a special join where we can reference previous tables. Note that it's similar to INNER JOIN (you had LEFT JOIN while you need INNER JOIN):

SELECT a.Name, 'Microsoft' AS Company, b.Specialty
FROM Table_A AS a
  CROSS APPLY 
    ( SELECT TOP (1) bi.Specialty 
      FROM Table_B AS bi
      WHERE a.ID = bi.Foreign_ID
        AND b.Company = 'Microsoft'
      ORDER BY Specialty 
    ) AS b ;

Related Solutions

Sql-server – How to speed up a query that orders by a calculated field

If you don't really need zero-second actuality, you could just run your query time to time and cache the results.

If you still need to have real-time data on this (sacrificing insert performance), I would do this:

Since self-joins are not allowed in indexed views, you need to create two copies of each table:

CREATE TABLE personBrother
        (
        personId INT NOT NULL,
        brotherName INT NOT NULL
        )

CREATE TABLE personBrother2
        (
        personId INT NOT NULL,
        brotherName INT NOT NULL
        )

Create an indexed view on their join:

CREATE VIEW
        commonBrothers
WITH SCHEMABINDING
AS
        SELECT  p1.personId AS p1,
                p2.personId AS p2,
                COUNT_BIG(*) AS cnt
        FROM    dbo.personBrother p1
        JOIN    dbo.personBrother2 p2
        ON      p1.brotherName = p2.brotherName
        WHERE   p1.personId < p2.personId
        GROUP BY
                p1.personId, p2.personId

CREATE UNIQUE CLUSTERED INDEX
        ux_commonBrothers_p1_p2
ON      commonBrothers (p1, p2)

CREATE INDEX
        ix_commonBrothers_cnt
ON      commonBrothers (cnt)

Same for sisters.

You should manually maintain these tables to have same data (write a trigger, insert/update/delete both etc).

Now we can easily get pairs with the most brothers and most sisters:

SELECT  TOP 1 WITH TIES
        *
FROM    commonBrothers
ORDER BY
        cnt DESC

All we need now is to fetch a greatest sum. Unfortunately, we cannot index a join of these views (it's a pure implementation flaw, there's no theoretical limitation for this).

So we need to do the following: the top pair cannot have less brothers than the top sis pair. Same holds for the sisters. So we have this query:

SELECT  TOP 1 WITH TIES
        cb.p1, cb.p2, cb.cnt + cs.cnt AS totalCnt
FROM    commonBrothers cb
JOIN    commonSisters cs
ON      cs.p1 = cb.p1
        AND cs.p2 = cb.p2
WHERE   cs.cnt >=
        (
        SELECT  MAX(cst.cnt)
        FROM    (
                SELECT  TOP 1 WITH TIES
                        p1, p2
                FROM    commonBrothers 
                ORDER BY
                        cnt DESC
                ) cbt
        JOIN    commonSisters cst
        ON      cst.p1 = cbt.p1
                AND cst.p2 = cbt.p2
        )
        AND cb.cnt >=
        (
        SELECT  MAX(cbt.cnt)
        FROM    (
                SELECT  TOP 1 WITH TIES
                        p1, p2
                FROM    commonSisters
                ORDER BY
                        cnt DESC
                ) cst
        JOIN    commonBrothers cbt
        ON      cbt.p1 = cst.p1
                AND cbt.p2 = cst.p2
        )
ORDER BY
        totalCnt DESC

If the numbers of common brothers and sisters are correlated, this query will be very fast.

This solution has two drawbacks:

DML performance: if you insert or delete a record for a name shared by million brothers, the indexed view will get 2M inserts or delete. This is the price you pay for real-time query: the kind of data you are asking for cannot be easily indexed.
Persons with 0 brothers or 0 sisters will not be indexed. If there's a chance that top pair will not have brothers or sisters, you should amend the last query a little.

SQL Server Database Design – Flexible vs. Fixed Tables

I'd suggest you have a think about this system's use cases. Work out in your own mind how the queries will look for the given design and for your alternative. Include scenarios where people move between slots. What do updates look like? Can business constraints be enforced in DRI? Can the desired values still be found? Having actual examples to talk through will highlight your concerns and allow your boss to explain the model more fully.

Writing horror queries against a ghastly schema is only one part of the problem. Someone has to fix them, at 3am, with no documentation, when they break. More intelligible queries at this stage will give you a better system, sooner.

If it calms your nerves any, I've had success with EAV systems. That said, we were a team of very experienced DBAs, it really was a blue-sky, "what if" project and, no, we didn't bet the company on it. Any dev project that will be heading to production we discourage most strongly from using an EAV approach.

If you go the 'flexible' route do not try to write generic queries just because you have a generic schema. Each query should only deal with one of Company, Location, Region or whatever your slot types are.

My two cents' worth.

Best Answer

Related Solutions

Sql-server – How to speed up a query that orders by a calculated field

SQL Server Database Design – Flexible vs. Fixed Tables

Related Question