How Does a Subquery Use Main Query Columns

subquerywhere

I'm really confused by this SQL here, it's used in our app but I have no idea how it works. I put it in from examples I found online and it seems to have no issues.

What it's doing is grabbing the latest timestamp of a set. We couldn't use ORDER BY as there had to be multiple groupings. The actual query is much larger and this is just a cut down version showing the fundamental selection.

select * from table1 as main
WHERE NOT (EXISTS (select * from table1 sub
WHERE sub.datetime > main.datetime));

(Live online version: http://sqlfiddle.com/#!17/290cc/2)

What I think should be happening in my mind:

Two identical tables selected, the constraint in the subquery is saying:

Only select the rows from the subquery which have a date greater than any other row in the main query.

In this case:

Subquery id:1 has only a smaller than or equal to date value from all rows in the main query so it doesn't get selected.
Subquery id:2 has a date which is larger than id:1 from the main query selection, so id:2 gets selected once. It is not larger than anything else.
Subquery id:3 has a date which is larger than both id:1 and id:2 from the main query selection, so it gets selected twice.

So the return from the subquery should be (2, 3, 3). So the main query should do a selection which not exists in that set, which should return id:1, but it returns id:3.

Where is my misunderstanding?

Best Answer

Well, there are some misunderstandings about what the subquery is doing.

First of all, EXISTS works in this case by evaluating the subquery for every row of the main query, and returns true if the subquery returns even a single row.

Since you are using WHERE NOT EXISTS(....), what's happening for every row from the main query is the following:

For id = 1: The subquery evaluates if there exists a row in the same table, where the date is greater than the date for id = 1. And the results would be ids 2 and 3, hence EXISTS evaluates to true, and NOT EXISTS eliminates this row
For id = 2: Same as before, in this case the subquery would return id 3, hence EXISTS evaluates to true, and NOT EXISTS eliminates this row
For id = 3: There are no other rows in this table that have a date greater than it, so EXISTS evaluates to false, and NOT EXISTS returns true, so you get this row as a result

Related Solutions

MySQL subqueries that use range based on values of main queries don’t use indices properly

This small data sample doesn't serve to illustrate that the behavior you are attempting to identify exists. Indeed, I've tested it on a larger data set and it did use the Index (using MySQL 5.5.30).

The problem is that when the optimizer determines that using an index would result in an inordinately large number of matches -- compared to the total number of rows in the table -- it won't use an index, because that could actually perform worse than simply scanning the whole table, and it will exhibit exactly the behavior this example illustrates... it knows the index is a candidate, but it chooses not to use it.

But I would suggest the problem lies in the fact that you're using a subquery in a place where a subquery isn't really necessary or called for.

I rewrote this as a join, because, from what I can tell, this is what you're asking the database to do: join each row in Person to the matching row(s) in CofeeBreaks where that person took their break during that window, and average the ages of the attendees.

I also built this here on SQL Fiddle. Removed TEMPORARY from the table definitions because the Fiddle doesn't seem to support them properly (because it probably uses a connection pool).

SELECT cb.id,
       cb.cofeeBreakStart,
       cb.cofeeBreakEnd,
       avg(p.age)
  FROM CofeeBreaks cb
  JOIN Person p ON p.lastCofee BETWEEN cb.cofeeBreakStart AND cb.cofeeBreakEnd
 GROUP BY cb.id;

EXPLAIN SELECT on this subquery shows that the index is being used, even on this small data set... although if you change that to a LEFT JOIN (which would show all coffee breaks even if nobody took that particular break, while the JOIN only includes breaks where at least one person did), the index shows up as being a candidate, but doesn't get used... again, likely because of the cost, and this behavior would likely be different with a larger data set.

The LEFT JOIN version of the query would produce identical results to your subquery regardless of the table data, while the JOIN version only produces identical results if every CofeeBreak had at least one person taking that break, which in your sample data, it does.

But using the indexes or not, a correlated subquery will not usually scale as well as a join.

http://dev.mysql.com/doc/refman/5.5/en/rewriting-subqueries.html

SQL Server Optimization – Why SQL Server Runs Subquery for Each Row

The slow plan isn't calculating the MAX for each row in the outer query.

In fact it never explicitly calculates it at all.

It gives a plan similar to

WITH CTE
     AS (SELECT TOP(1) WITH TIES *
         FROM   SubqueryTest
         WHERE year IS NOT NULL
         ORDER  BY year desc)
SELECT month,
       count(*)
FROM   CTE
GROUP  BY month

Slow Plan (Estimated Row Counts)

enter image description here

You have a non covering index on year asc so it scans that backwards to get the rows in the first year (shows as a seek because of the implicit IS NOT NULL predicate).

Unfortunately it doesn't seem to differentiate between TOP 1 and TOP 1 WITH TIES when estimating row counts.

In this case it makes a huge difference. (estimated 2 key lookup vs actual 4,424,803) so you get an inappropriate plan.

Slow Plan (Actual Row Counts)

enter image description here

You could consider adding month into the index on year either as a key or included column to make the index covering. The benefit of adding it as a secondary key column would be that it could then feed into a stream aggregate without an additional sort (though for only 12 distinct values a hash aggregate would be fine anyway).

A non covering index on such a non selective column is really pretty useless for the vast majority of queries. The index is totally ignored by the "fast" plan which ends up doing a parallel scan on the whole table and evaluating the predicate on all 27,445,400 rows (in preference to performing the huge number of lookups).

enter image description here

Best Answer

Related Solutions

MySQL subqueries that use range based on values of main queries don’t use indices properly

SQL Server Optimization – Why SQL Server Runs Subquery for Each Row

Related Question