Sql-server – Correlated Subquery Versus Non-Correlated Subquery

sql serversubquery

There are three tables that may be used by a business to keep track of the absent days of its employees. It's from a collection of SQL puzzles by Joe Celko. The reader is tasked with concocting a DELETE statement to remove those employees who have accrued 40 or more absent days. And for that purpose we use a subquery. The one proposed by the author is this:

DELETE FROM Personnel
WHERE emp_id = (SELECT A1.emp_id
FROM Absenteeism AS A1
WHERE A1.emp_id = Personnel.emp_id
GROUP BY A1.emp_id
HAVING SUM(severity_points) >= 40);

And my question is, why go for a correlated one? Isn't it just more efficient to run something like one below?

DELETE FROM Personnel
WHERE emp_id =
(SELECT emp_id FROM Absenteeism
GROUP BY emp_id
HAVING SUM(severity_points) >= 40)

Thank you!

Best Answer

The non-correlated version you proposed will fail if the subquery returns more than one row (which it likely will). You should use IN rather than = here. The purpose of Celko's correlation is to make sure only zero or one rows is returned from the subquery (because it matches and groups on emp_id).

Also, don't forget to alias the table inside the subquery and explicitly reference the alias. Since both tables (Personnel and Absenteeism) have an emp_id column, you want to be very clear where you are referring to which table.

This is more of a "best practice" for readability, but can also prevent logical issues and mistakes in queries. For instance, this query would delete every row in the table:

DELETE FROM Personnel WHERE emp_id IN (SELECT emp_id);

Of course, this may seen like a contrived example, but it happens: sql server 2008 management studio not checking the syntax of my query

So here's the updated query:

DELETE FROM Personnel
WHERE emp_id IN
(SELECT a.emp_id FROM Absenteeism a
GROUP BY a.emp_id
HAVING SUM(a.severity_points) >= 40)

Given those changes: you're right that the correlation isn't strictly necessary in this case.

Logically, it seems like the correlated version might be more efficient, especially if there are lots and lots of rows in the Absenteeism table that don't have a match in the Personnel table (because the unbounded subquery might read more data).

But one would have to test to be sure. The query optimizer can take the declarative query you've written and do many different things with it.