SQL Subqueries – How to Identify Correlated Subqueries That Cannot Be Rewritten as JOINs Without Using DISTINCT

join;subquery

I am fairly new to SQL query tuning. I have been trying to understand how to write equivalent queries. While going through Prof. J. Widom's Stanford online video lectures, she makes mention that some subqueries that cannot be written by JOINs without the use of DISTINCT. For example, see this:

4:45 / 20:13 – GPA example

SELECT GPA
FROM Student
WHERE sID in (select sID from Apply where major = 'CS');

6:39 / 20:13 – students applied to CS but not to EE

SELECT sID, sName
FROM Student
WHERE sID IN (select sID from Apply where major = 'CS')
  AND sID NOT IN (select sID from Apply where major = 'EE');

My question is how can one know if an SQL statement written using subqueries will have an equivalent one written using joins. I am quite comfortable, if in the answers, someone likes to use relational algebra notation.

I searched quite a bit on the Net and could not find a suitable answer.

Sample Data

Schema and table creation (for PostgreSQL) are as follows,

CREATE TEMP TABLE college AS
SELECT cname::text, state::text, enrollment::int
FROM ( VALUES
  ('Stanford', 'CA', 15000),
  ('Berkeley', 'CA', 36000),
  ('MIT', 'MA', 10000),
  ('Cornell', 'NY', 21000)
) AS College(cname, state, enrollment);

CREATE TEMP TABLE student AS
SELECT sid::int, sname::text, gpa::real, sizehs::int
FROM ( VALUES
  (123, 'Amy', 3.9, 1000),
  (234, 'Bob', 3.6, 1500),
  (345, 'Craig', 3.5, 500),
  (456, 'Doris', 3.9, 1000),
  (567, 'Edward', 2.9, 2000),
  (678, 'Fay', 3.8, 200),
  (789, 'Gary', 3.4, 800),
  (987, 'Helen', 3.7, 800),
  (876, 'Irene', 3.9, 400),
  (765, 'Jay', 2.9, 1500),
  (654, 'Amy', 3.9, 1000),
  (543, 'Craig', 3.4, 2000)
) AS Student(sid, sname, gpa, sizehs);

CREATE TEMP TABLE apply AS
SELECT sid::int, cname::text, major::text, decision::text
FROM ( VALUES
  (123, 'Stanford', 'CS', 'Y'),
  (123, 'Stanford', 'EE', 'N'),
  (123, 'Berkeley', 'CS', 'Y'),
  (123, 'Cornell', 'EE', 'Y'),
  (234, 'Berkeley', 'biology', 'N'),
  (345, 'MIT', 'bioengineering', 'Y'),
  (345, 'Cornell', 'bioengineering', 'N'),
  (345, 'Cornell', 'CS', 'Y'),
  (345, 'Cornell', 'EE', 'N'),
  (678, 'Stanford', 'history', 'Y'),
  (987, 'Stanford', 'CS', 'Y'),
  (987, 'Berkeley', 'CS', 'Y'),
  (876, 'Stanford', 'CS', 'N'),
  (876, 'MIT', 'biology', 'Y'),
  (876, 'MIT', 'marine biology', 'N'),
  (765, 'Stanford', 'history', 'Y'),
  (765, 'Cornell', 'history', 'N'),
  (765, 'Cornell', 'psychology', 'Y'),
  (543, 'MIT', 'CS', 'N')
) AS apply(sid, cname, major, decision);

Best Answer

My question is how can one know if an SQL statement written using subqueries will have an equivalent one written using joins. I am quite comfortable, if in the answers, someone likes to use relational algebra notation.

Good question.

First Example Given

So let's look at the first query she gives..

SELECT GPA
FROM Student
WHERE sID in (SELECT sID from Apply where major = 'CS');

And, then let's look at a similar query with joins,

SELECT *
FROM student
JOIN apply ON (student.sid = apply.sid AND major = 'CS');

I've changed the columns in in the first and second intentionally. "GPA" here, is specifically student.GPA.

If the following, then the correlated subquery can not be rewritten as a simple join,

the SELECT statement has aggregates WHERE gpa in (SELECT max(gpa)..) (if the join requires aggregation or grouping, you'll have to use a derived or virtual table: JOIN ( SELECT.. ) AS t.
if the JOIN condition is NOT UNIQUE in the above, we assume there is one student with a matching row having both a common sid and major=CS. The resulset can grow or shrink of this assumption fails.

You can easily test that..

SELECT sid, count(*)
FROM apply
WHERE major='CS'
GROUP BY sid
HAVING count(*) > 1;

This returns two rows, both sids would get duped in the join.

 sid | count 
-----+-------
 123 |     2
 987 |     2

As an exercise in clarity, you could push the DISTINCT down yourself, and this can again be rewritten using a JOIN.

SELECT GPA
FROM student
JOIN (
  SELECT DISTINCT sid, major
  FROM apply
) AS apply
  ON (student.sid = apply.sid AND major = 'CS');

Or, even..

SELECT GPA
FROM student
JOIN (
  SELECT DISTINCT sid, major
  FROM apply
  WHERE major = 'CS'
) AS apply
  USING (sid);

Second Example Given

The second example,

SELECT sID, sName
FROM Student
WHERE sID IN (select sID from Apply where major = 'CS')
  AND sID NOT IN (select sID from Apply where major = 'EE');

Falls victim to the first example exactly. However, the NOT IN,

Doesn't matter for UNIQUE-ness because we're excluding results.
Moreover, rewriting the NOT IN as an anti-join (LEFT OUTER JOIN ... ON NULL) is likely faster.
It will handle any condition where apply.sid IS NULL more effectively, including the case where there are no major = 'EE'

Here is an example,

SELECT student.sID, sName
FROM Student
LEFT OUTER JOIN apply ON (student.sid = apply.sid AND major = 'EE')
WHERE student.sID IN (select sID from Apply where major = 'CS')
  AND apply.sid IS NULL;

In summary

Joins (including INNER JOINS) can increase or decrease the amount of rows in the result set.
1. If the JOIN condition is never true, the result on an INNER JOIN is pruned.
2. If the result is true more than once for the same row, the result is that the amount of rows returned increases.
But, joins join two tables. On t1 JOIN t2,
1. if the t1, and t2 have no duplicate rows then SELECT t1.*, t2.* should have no dupes. Something should be different.
2. If either t1 or t2 have duplicate rows then a JOIN that includes those row will result in duplicate output rows.
3. If you SELECT a subset of t1, or a subset of t2 (as you did above when you ONLY selected GPA) then it's possible to have something that appears to be a dupe, though it's not. Had you have selected student.GPA, apply.* you would have seen the reason for the duping.

Related Solutions

TheSQL query optimisation — multiple joins or select … where not in (select distinct…)

I would add an index on users_roles (rid, uid). In a many-to-many table with two columns (a,b), you almost always will need both indexes: (a,b) and (b,a) in one query or the other. I think this index would help in this query.
Try various rewritings of the query and the EXPLAIN EXTENDED they produce.
About your suggestions, the first is not correct (it will not show same results). For the second suggestion:

WHERE users.status = 1                                           -- Active users only

Yes, that's better than users.status <> 0. This change may have a better effect if there is an index on users (status) (and even more if there are not many active users). Optimizing queries with boolean columns (or ones that act as boolean) is not easy with B-trees.

  AND users.uid IN
     (SELECT DISTINCT uid FROM users_roles WHERE rid = 5)        -- Must be in rôle A

No. MySQL is known to have issues with column IN (SELECT ...), especially if the external table is big (and yours is 200K columns, so no, not good).

  AND users.uid NOT IN
     (SELECT DISTINCT uid FROM users_roles WHERE rid IN (6,8,9)) -- Not rôles B, C, D

Yes, that is one way to rewrite. The DISTINCT is redundant though.

  AND users.uid <> :users_uid                                    -- Not current user

Yes, removing the users.uid IS NOT NULL may help and does not change the result.

Other things you could try:

Moving the rid = 5 condition to the ON clause:

INNER JOIN users_roles users_roles 
  ON  users.uid = users_roles.uid
  AND users_roles.rid = 5

The (rewrite) to NOT IN can also be written with NOT EXISTS:

  AND NOT EXISTS 
      ( SELECT * 
        FROM users_roles ur 
        WHERE ur.uid = users.uid 
          AND ur.rid IN (6,8,9)
      )

MySQL: Very slow query, not using index, no joins involved

You may want the following indexes

ALTER TABLE tran.member ADD INDEX cardnbr_mbr_nbr_ndx (cardnbr,mbr_nbr);
ALTER TABLE tran.member ADD INDEX mbr_nbr_cardnbr_ndx (mbr_nbr,cardnbr);

Then, you could queries like these

Each Member and All Cards for That Member

SELECT mbr_nbr MemberNumber,GROUP_CONCAT(cardnbr) Cards
FROM tran.member GROUP BY mbr_nbr;

Each Card and All Members Associated with that Card

SELECT cardnbr CardNumber,GROUP_CONCAT(mbr_nbr) Members
FROM tran.member GROUP BY cardnbr;

So, that the size of tran.member does not slow things down, try creating a temp table for the purpose of gathering these two columns

DROP TABLE IF EXISTS tran.adhoc_members_cards;
CREATE TABLE tran.adhoc_members_cards
SELECT mbr_nbr,cardnbr FROM tran.member WHERE 1=2;
ALTER TABLE tran.adhoc_members_cards ENGINE=MyISAM;
ALTER TABLE tran.adhoc_members_cards ADD INDEX cardnbr_mbr_nbr_ndx (cardnbr,mbr_nbr);
ALTER TABLE tran.adhoc_members_cards ADD INDEX mbr_nbr_cardnbr_ndx (mbr_nbr,cardnbr);
ALTER TABLE tran.adhoc_members_cards DISABLE KEYS;
INSERT INTO tran.adhoc_members_cards
SELECT mbr_nbr,cardnbr FROM tran.memberl
SELECT mbr_nbr,cardnbr FROM tran.member WHERE 1=2;
ALTER TABLE tran.adhoc_members_cards ENABLE KEYS;

Now, you can run the two queries I mentioned like this:

SELECT mbr_nbr MemberNumber,GROUP_CONCAT(cardnbr) Cards
FROM tran.adhoc_members_cards GROUP BY mbr_nbr;
SELECT cardnbr CardNumber,GROUP_CONCAT(mbr_nbr) Members
FROM tran.adhoc_members_cards GROUP BY cardnbr;

Search for the cardmember you want as well:

SELECT cardnbr CardNumber,GROUP_CONCAT(mbr_nbr) Members
FROM tran.adhoc_members_cards WHERE cardnbr = 000099999930;