Performance: order of tables in joined statement

join;performancesqlite

I have the following SQL statement, running on a SQLite database on a Windows mobile device.

SELECT 
    table1.uniqueidentifier1, table1.int1, table1.varchar1, 
    table1.decimal1, table1.decimal2 
FROM table1
INNER JOIN table2 On table1.PK = table2.FK
WHERE table2.uniqueidentifier2 IN (uniqueidentifier1,uniqueidentifier2,....)
ORDER BY table1.varchar1

As there are several hundred thousand records in each table and the device isn't really new this takes some time.

Would the performance be better, if I switched the tables, somehow like this:

SELECT 
    table1.uniqueidentifier1, table1.int1, table1.varchar1, 
    table1.decimal1, table1.decimal2 
FROM table2
INNER JOIN table1 On table1.PK = table2.FK
WHERE table2.uniqueidentifier2 IN (uniqueidentifier1,uniqueidentifier2,....)
ORDER BY table1.varchar1

Please note: in the first statement I select from table 1` and join table 2, in the second, it is switched.

Why or why not is it faster?

Best Answer

SQLite automatically chooses the estimated optimal join order; the table order in the query has no effect.

There are two ways to optimize this query. If the filter on uniqueidentifier2 removes most table1 records from the result, then it would be fastest to look up table2 records with matching uniqueidentifier2 values first, then to look up the corresponding table1 records, and then to sort the result. This would require the following indexes:

CREATE INDEX t2_uid2_idx ON table2(uniqueidentifier2);
-- CREATE INDEX t1_pk ON table1(PK);  -- primary keys have this automatically

If most table1 records will show up in the result, then it would be more efficient to go through table1 in the proper order and look up correspondig table2 records. This would require the following indexes:

CREATE INDEX t1_vc1_idx ON table1(varchar1);
CREATE INDEX t2_FK_uid2_idx ON table2(FK, uniqueidentifier2);

(Having uniqueidentifier2 in the second index optimizes for this particular query, but might be not worth the storage and update overhead if you have many other queries.)

To check how queries are actually implemented, execute EXPLAIN QUERY PLAN.

Related Solutions

SQL Server Join Processing – Understanding Join/Where Processing Order

The logical processing of a query is on MSDN (written by Microsoft SQL Server team, not 3rd parties)

1. FROM
2. ON
3. JOIN
4. WHERE
5. GROUP BY
6. WITH CUBE or WITH ROLLUP
7. HAVING
8. SELECT
9. DISTINCT
10. ORDER BY
11. TOP

A derived table follows this, then the outer query does it again etc etc

This is logical though: not actual. No matter how SQL Server actually does it, these semantics are honoured to the letter. The "actual" is determined by the Query Optimiser (QO) and you avoid the intermediate Cartesion product you mentioned.

It's worth mentioning that SQL is declarative: you say "what" not "how" like you would for a procedural/imperative programming (Java, .net). So saying "this happens before that" is wrong in many cases (eg assumption of short circuits or L-to-R WHERE order)

In your case above, the QO will generate the same plan no matter how it is structured because it is a simple query.

However, the QO is cost based and for a complex query it may take 2 weeks to generate the ideal plan. So it does "good enough" which actually isn't.

So your first case may help the optimiser find a better plan because the logical processing order is different for the 2 queries. But it may not.

I have used this trick on SQL Server 2000 to get 60x speed performance improvement on reporting queries. As the QO improves version to version it gets better at working these things out.

And the book you mentioned: there is some dispute over it
See SO and the subsequent links: https://stackoverflow.com/q/3270338/27535

Postgresql – Retrieving data from inner join using LIMIT and OFFSET

Use a subquery (as displayed) or CTE for that purpose:

SELECT *
FROM  (
   SELECT qid, gid
   FROM   table1
   ORDER  BY date DESC
   LIMIT  10
   OFFSET ?
   ) q
JOIN   table2 a USING (qid, gid)

USING (qid, gid) is just a shortcut for ON q.qid = a.qid AND q.gid = a.gid with the side effect that the two columns are only included once in the result.

Best Answer

Related Solutions

SQL Server Join Processing – Understanding Join/Where Processing Order

Postgresql – Retrieving data from inner join using LIMIT and OFFSET

Related Question