SQL Subqueries – How to Identify Correlated Subqueries That Cannot Be Rewritten as JOINs Without Using DISTINCT

join;subquery

I am fairly new to SQL query tuning. I have been trying to understand how to write equivalent queries. While going through Prof. J. Widom's Stanford online video lectures, she makes mention that some subqueries that cannot be written by JOINs without the use of DISTINCT. For example, see this:

  1. 4:45 / 20:13 – GPA example

    SELECT GPA
    FROM Student
    WHERE sID in (select sID from Apply where major = 'CS');
    
  2. 6:39 / 20:13 – students applied to CS but not to EE

    SELECT sID, sName
    FROM Student
    WHERE sID IN (select sID from Apply where major = 'CS')
      AND sID NOT IN (select sID from Apply where major = 'EE');
    

My question is how can one know if an SQL statement written using subqueries will have an equivalent one written using joins. I am quite comfortable, if in the answers, someone likes to use relational algebra notation.

I searched quite a bit on the Net and could not find a suitable answer.

Sample Data

Schema and table creation (for PostgreSQL) are as follows,

CREATE TEMP TABLE college AS
SELECT cname::text, state::text, enrollment::int
FROM ( VALUES
  ('Stanford', 'CA', 15000),
  ('Berkeley', 'CA', 36000),
  ('MIT', 'MA', 10000),
  ('Cornell', 'NY', 21000)
) AS College(cname, state, enrollment);

CREATE TEMP TABLE student AS
SELECT sid::int, sname::text, gpa::real, sizehs::int
FROM ( VALUES
  (123, 'Amy', 3.9, 1000),
  (234, 'Bob', 3.6, 1500),
  (345, 'Craig', 3.5, 500),
  (456, 'Doris', 3.9, 1000),
  (567, 'Edward', 2.9, 2000),
  (678, 'Fay', 3.8, 200),
  (789, 'Gary', 3.4, 800),
  (987, 'Helen', 3.7, 800),
  (876, 'Irene', 3.9, 400),
  (765, 'Jay', 2.9, 1500),
  (654, 'Amy', 3.9, 1000),
  (543, 'Craig', 3.4, 2000)
) AS Student(sid, sname, gpa, sizehs);

CREATE TEMP TABLE apply AS
SELECT sid::int, cname::text, major::text, decision::text
FROM ( VALUES
  (123, 'Stanford', 'CS', 'Y'),
  (123, 'Stanford', 'EE', 'N'),
  (123, 'Berkeley', 'CS', 'Y'),
  (123, 'Cornell', 'EE', 'Y'),
  (234, 'Berkeley', 'biology', 'N'),
  (345, 'MIT', 'bioengineering', 'Y'),
  (345, 'Cornell', 'bioengineering', 'N'),
  (345, 'Cornell', 'CS', 'Y'),
  (345, 'Cornell', 'EE', 'N'),
  (678, 'Stanford', 'history', 'Y'),
  (987, 'Stanford', 'CS', 'Y'),
  (987, 'Berkeley', 'CS', 'Y'),
  (876, 'Stanford', 'CS', 'N'),
  (876, 'MIT', 'biology', 'Y'),
  (876, 'MIT', 'marine biology', 'N'),
  (765, 'Stanford', 'history', 'Y'),
  (765, 'Cornell', 'history', 'N'),
  (765, 'Cornell', 'psychology', 'Y'),
  (543, 'MIT', 'CS', 'N')
) AS apply(sid, cname, major, decision);

Best Answer

My question is how can one know if an SQL statement written using subqueries will have an equivalent one written using joins. I am quite comfortable, if in the answers, someone likes to use relational algebra notation.

Good question.

First Example Given

So let's look at the first query she gives..

SELECT GPA
FROM Student
WHERE sID in (SELECT sID from Apply where major = 'CS');

And, then let's look at a similar query with joins,

SELECT *
FROM student
JOIN apply ON (student.sid = apply.sid AND major = 'CS');

I've changed the columns in in the first and second intentionally. "GPA" here, is specifically student.GPA.

If the following, then the correlated subquery can not be rewritten as a simple join,

  1. the SELECT statement has aggregates WHERE gpa in (SELECT max(gpa)..) (if the join requires aggregation or grouping, you'll have to use a derived or virtual table: JOIN ( SELECT.. ) AS t.
  2. if the JOIN condition is NOT UNIQUE in the above, we assume there is one student with a matching row having both a common sid and major=CS. The resulset can grow or shrink of this assumption fails.

You can easily test that..

SELECT sid, count(*)
FROM apply
WHERE major='CS'
GROUP BY sid
HAVING count(*) > 1;

This returns two rows, both sids would get duped in the join.

 sid | count 
-----+-------
 123 |     2
 987 |     2

As an exercise in clarity, you could push the DISTINCT down yourself, and this can again be rewritten using a JOIN.

SELECT GPA
FROM student
JOIN (
  SELECT DISTINCT sid, major
  FROM apply
) AS apply
  ON (student.sid = apply.sid AND major = 'CS');

Or, even..

SELECT GPA
FROM student
JOIN (
  SELECT DISTINCT sid, major
  FROM apply
  WHERE major = 'CS'
) AS apply
  USING (sid);

Second Example Given

The second example,

SELECT sID, sName
FROM Student
WHERE sID IN (select sID from Apply where major = 'CS')
  AND sID NOT IN (select sID from Apply where major = 'EE');

Falls victim to the first example exactly. However, the NOT IN,

  • Doesn't matter for UNIQUE-ness because we're excluding results.
  • Moreover, rewriting the NOT IN as an anti-join (LEFT OUTER JOIN ... ON NULL) is likely faster.
  • It will handle any condition where apply.sid IS NULL more effectively, including the case where there are no major = 'EE'

Here is an example,

SELECT student.sID, sName
FROM Student
LEFT OUTER JOIN apply ON (student.sid = apply.sid AND major = 'EE')
WHERE student.sID IN (select sID from Apply where major = 'CS')
  AND apply.sid IS NULL;

In summary

  1. Joins (including INNER JOINS) can increase or decrease the amount of rows in the result set.
    1. If the JOIN condition is never true, the result on an INNER JOIN is pruned.
    2. If the result is true more than once for the same row, the result is that the amount of rows returned increases.
  2. But, joins join two tables. On t1 JOIN t2,
    1. if the t1, and t2 have no duplicate rows then SELECT t1.*, t2.* should have no dupes. Something should be different.
    2. If either t1 or t2 have duplicate rows then a JOIN that includes those row will result in duplicate output rows.
    3. If you SELECT a subset of t1, or a subset of t2 (as you did above when you ONLY selected GPA) then it's possible to have something that appears to be a dupe, though it's not. Had you have selected student.GPA, apply.* you would have seen the reason for the duping.