I am fairly new to SQL query tuning. I have been trying to understand how to write equivalent queries. While going through Prof. J. Widom's Stanford online video lectures, she makes mention that some subqueries that cannot be written by JOIN
s without the use of DISTINCT
. For example, see this:
-
4:45 / 20:13 – GPA example
SELECT GPA FROM Student WHERE sID in (select sID from Apply where major = 'CS');
-
6:39 / 20:13 – students applied to CS but not to EE
SELECT sID, sName FROM Student WHERE sID IN (select sID from Apply where major = 'CS') AND sID NOT IN (select sID from Apply where major = 'EE');
My question is how can one know if an SQL statement written using subqueries will have an equivalent one written using joins. I am quite comfortable, if in the answers, someone likes to use relational algebra notation.
I searched quite a bit on the Net and could not find a suitable answer.
Sample Data
Schema and table creation (for PostgreSQL) are as follows,
CREATE TEMP TABLE college AS
SELECT cname::text, state::text, enrollment::int
FROM ( VALUES
('Stanford', 'CA', 15000),
('Berkeley', 'CA', 36000),
('MIT', 'MA', 10000),
('Cornell', 'NY', 21000)
) AS College(cname, state, enrollment);
CREATE TEMP TABLE student AS
SELECT sid::int, sname::text, gpa::real, sizehs::int
FROM ( VALUES
(123, 'Amy', 3.9, 1000),
(234, 'Bob', 3.6, 1500),
(345, 'Craig', 3.5, 500),
(456, 'Doris', 3.9, 1000),
(567, 'Edward', 2.9, 2000),
(678, 'Fay', 3.8, 200),
(789, 'Gary', 3.4, 800),
(987, 'Helen', 3.7, 800),
(876, 'Irene', 3.9, 400),
(765, 'Jay', 2.9, 1500),
(654, 'Amy', 3.9, 1000),
(543, 'Craig', 3.4, 2000)
) AS Student(sid, sname, gpa, sizehs);
CREATE TEMP TABLE apply AS
SELECT sid::int, cname::text, major::text, decision::text
FROM ( VALUES
(123, 'Stanford', 'CS', 'Y'),
(123, 'Stanford', 'EE', 'N'),
(123, 'Berkeley', 'CS', 'Y'),
(123, 'Cornell', 'EE', 'Y'),
(234, 'Berkeley', 'biology', 'N'),
(345, 'MIT', 'bioengineering', 'Y'),
(345, 'Cornell', 'bioengineering', 'N'),
(345, 'Cornell', 'CS', 'Y'),
(345, 'Cornell', 'EE', 'N'),
(678, 'Stanford', 'history', 'Y'),
(987, 'Stanford', 'CS', 'Y'),
(987, 'Berkeley', 'CS', 'Y'),
(876, 'Stanford', 'CS', 'N'),
(876, 'MIT', 'biology', 'Y'),
(876, 'MIT', 'marine biology', 'N'),
(765, 'Stanford', 'history', 'Y'),
(765, 'Cornell', 'history', 'N'),
(765, 'Cornell', 'psychology', 'Y'),
(543, 'MIT', 'CS', 'N')
) AS apply(sid, cname, major, decision);
Best Answer
Good question.
First Example Given
So let's look at the first query she gives..
And, then let's look at a similar query with joins,
I've changed the columns in in the first and second intentionally. "GPA" here, is specifically
student.GPA
.If the following, then the correlated subquery can not be rewritten as a simple
join
,SELECT
statement has aggregatesWHERE gpa in (SELECT max(gpa)..)
(if the join requires aggregation or grouping, you'll have to use a derived or virtual table:JOIN ( SELECT.. ) AS t
.JOIN
condition is NOT UNIQUE in the above, we assume there is onestudent
with a matching row having both a commonsid
andmajor=CS
. The resulset can grow or shrink of this assumption fails.You can easily test that..
This returns two rows, both
sid
s would get duped in the join.As an exercise in clarity, you could push the
DISTINCT
down yourself, and this can again be rewritten using aJOIN
.Or, even..
Second Example Given
The second example,
Falls victim to the first example exactly. However, the
NOT IN
,UNIQUE
-ness because we're excluding results.NOT IN
as an anti-join (LEFT OUTER JOIN ... ON NULL
) is likely faster.major = 'EE'
Here is an example,
In summary
INNER JOINS
) can increase or decrease the amount of rows in the result set.JOIN
condition is never true, the result on anINNER JOIN
is pruned.t1 JOIN t2
,t1
, andt2
have no duplicate rows thenSELECT t1.*, t2.*
should have no dupes. Something should be different.t1
ort2
have duplicate rows then aJOIN
that includes those row will result in duplicate output rows.SELECT
a subset oft1
, or a subset oft2
(as you did above when you ONLY selected GPA) then it's possible to have something that appears to be a dupe, though it's not. Had you have selectedstudent.GPA, apply.*
you would have seen the reason for the duping.