Sql-server – SQL Server Join/where processing order

join;performancesql server

After reading Slow SQL query, not sure how to optimize, it got me thinking about the general performance of queries. Surely, we need the results of the first table (when other tables are joined) to be as small as possible before joining (inner joins for this question) in order to make our queries that tiny bit faster.

Example, should this:

SELECT *
FROM   ( SELECT * FROM table1 WHERE col = @val ) t
INNER JOIN table2 ON col = col2

Be better/faster than:

SELECT *
FROM table1
INNER JOIN table2 ON col = col2
WHERE table1.col = @val

My theory is as follows (this might not be the correct implementation, I am trying to remember from a SQL Server 2008 internals book I read (MSFT Press)):

  1. The query processor first gets the left table (table1)
  2. Joins the second table (table2) and forms a cartesian product before filtering out the necessary rows (if applicable)
  3. Then performs the WHERE, ORDER BY, GROUP BY, HAVING clauses with the SEELCT statement last.

So if in statement #1 above, the table is smaller, the SQL engine has less work to do when forming the cartesian products. Then when you reach the where statement, you have a reduced result set from which to filter in memory.

I could be so far off the mark it's unreal. Like I said, it's a theory.

Your thoughts?

Note: I've only just thought of this question and haven't had chance to run any tests my self yet.

Note 2: Tagged as SQL Server as I don't know anything about the implementation of MySql etc. Please feel free to answer/comment anyway

Best Answer

The logical processing of a query is on MSDN (written by Microsoft SQL Server team, not 3rd parties)

1. FROM
2. ON
3. JOIN
4. WHERE
5. GROUP BY
6. WITH CUBE or WITH ROLLUP
7. HAVING
8. SELECT
9. DISTINCT
10. ORDER BY
11. TOP

A derived table follows this, then the outer query does it again etc etc

This is logical though: not actual. No matter how SQL Server actually does it, these semantics are honoured to the letter. The "actual" is determined by the Query Optimiser (QO) and you avoid the intermediate Cartesion product you mentioned.

It's worth mentioning that SQL is declarative: you say "what" not "how" like you would for a procedural/imperative programming (Java, .net). So saying "this happens before that" is wrong in many cases (eg assumption of short circuits or L-to-R WHERE order)

In your case above, the QO will generate the same plan no matter how it is structured because it is a simple query.

However, the QO is cost based and for a complex query it may take 2 weeks to generate the ideal plan. So it does "good enough" which actually isn't.

So your first case may help the optimiser find a better plan because the logical processing order is different for the 2 queries. But it may not.

I have used this trick on SQL Server 2000 to get 60x speed performance improvement on reporting queries. As the QO improves version to version it gets better at working these things out.

And the book you mentioned: there is some dispute over it
See SO and the subsequent links: https://stackoverflow.com/q/3270338/27535