In the case of a join followed by a where clause, would it be better to use a subquery to limit the results, and then do the join? Example:
SELECT *
FROM Customers
NATURAL JOIN Orders
WHERE shipped=1
In this case, it seams the DBMS will join the entire Customers table with the entire Orders table, and then filter the results based on the where clause. An equivalent query using subquery would be:
SELECT *
FROM Customers
NATURAL JOIN (SELECT *
FROM Orders
WHERE shipped=1) AS O
Here, there is potentially a smaller Orders table to JOIN with. Likewise, if there was a where clause that limited both Customers and Orders:
SELECT *
FROM Customers
NATURAL JOIN Orders
WHERE country='US' AND shipped=1
(assuming country attribute belongs to Customers table)
An equivalent subquery query:
SELECT *
FROM (SELECT *
FROM Customers
WHERE country='US') AS C
NATURAL JOIN (SELECT *
FROM Orders
WHERE shipped=1) AS O
Best Answer
The answer to your question depends on the specific database and version you are using. In any case most current databases will optimize the query and end-up having the same execution plan for all cases.
I'd go for the simplest syntax that clearly states your purpose. It is probably the one that the database will be able to best optimize. And, specially, it is the one that will be easier to interpret by yourself, or anyone modifying any application later on. If you find out that some query seems to underperform, then you should check execution plans, alternatives, and find out if it can actually be improved.
You can check it with most DBs with a statement similar to
EXPLAIN SELECT * FROM Customers NATURAL JOIN Orders WHERE shipped=1
and check what the execution plan is.This is how you would do it when using PostgreSQL:
Creation and population of (mockup) tables:
Make sure the database has proper statistics:
Check for the exeuction plan of the simplest query:
Check for the execution plan of the second version of the query:
You can see that, in this case, PostgreSQL (9.6.2) uses exactly the same execution plan. This is what will happen most often.
dbfiddle here
A slightly different version, where one index is defined (and where we decided that
shipped
orders were minority, not majority):... works with a different plan, but again, the same execution plan for both the simple and the shipped filtered first queries:
dbfiddle here
NOTE: The small differences in execution time can be due to any external factors, such as other processes using the database at the same time or not. They should only be taken into consideration if you perform the queries a statistically significant number of times, and perform the adequate statistical test for significance.
NOTE 2: I've used the
NATURAL JOIN
as you did, to not introduce more elements to the analysis. In practice, I'd rather use eitherON t1.col = t2.col
(compatible with mostly all databases) orUSING (col)
. The meaning of both remains constant.NATURAL JOIN
can change meanings if you add columns to your tables. For instance, you might decide to add columns forlast_modified_at
,last_modified_by
, to track the "age" of your data ... and your NATURAL JOINs stop working. In short: don't use them.NATURAL JOIN
does not refer to joining using the columns participating in a foreign key constraint, as you might have thought. They just refer to columns having the same name in both tables. And that's risky. I have plenty of tables with columns called created_at and last_modified_at, for instance, and it doesn't make any sense to use them to JOIN.