Should i use subquery to limit table before a join

optimizationperformancequery-performancesubquery

In the case of a join followed by a where clause, would it be better to use a subquery to limit the results, and then do the join? Example:

SELECT * 
FROM Customers 
    NATURAL JOIN Orders 
WHERE shipped=1

In this case, it seams the DBMS will join the entire Customers table with the entire Orders table, and then filter the results based on the where clause. An equivalent query using subquery would be:

SELECT * 
FROM Customers 
    NATURAL JOIN (SELECT * 
                     FROM Orders 
                     WHERE shipped=1) AS O

Here, there is potentially a smaller Orders table to JOIN with. Likewise, if there was a where clause that limited both Customers and Orders:

SELECT * 
FROM Customers 
    NATURAL JOIN Orders 
WHERE country='US' AND shipped=1
(assuming country attribute belongs to Customers table)

An equivalent subquery query:

SELECT * 
FROM (SELECT *
            FROM Customers 
            WHERE country='US') AS C 
    NATURAL JOIN (SELECT * 
                                 FROM Orders 
                                 WHERE shipped=1) AS O

Best Answer

The answer to your question depends on the specific database and version you are using. In any case most current databases will optimize the query and end-up having the same execution plan for all cases.

I'd go for the simplest syntax that clearly states your purpose. It is probably the one that the database will be able to best optimize. And, specially, it is the one that will be easier to interpret by yourself, or anyone modifying any application later on. If you find out that some query seems to underperform, then you should check execution plans, alternatives, and find out if it can actually be improved.

You can check it with most DBs with a statement similar to EXPLAIN SELECT * FROM Customers NATURAL JOIN Orders WHERE shipped=1 and check what the execution plan is.

This is how you would do it when using PostgreSQL:

Creation and population of (mockup) tables:

 CREATE TABLE Customers
 (
     customer_id integer primary key,
     customer_name text not null
 ) ;

 CREATE TABLE Orders
 (
     order_id integer primary key,
     customer_id integer NOR NULL REFERENCES Customers(customer_id),
     whatever text,
     shipped boolean
 ) ;

 -- We invent 1_000 customers
 INSERT INTO 
     Customers (customer_id, customer_name)
 SELECT
     i, 'Name ' || i
 FROM
     generate_series (1, 1000) as s(i) ;

 -- and 25_000 orders
 INSERT INTO
     Orders (order_id, customer_id, whatever, shipped)
 SELECT
     i AS order_id,  
     1 + 999*random() AS customer_id,
     'a text' AS whatever,
     (random() < 0.85) AS shipped
 FROM
     generate_series(1, 25000) AS s(i) ;

Make sure the database has proper statistics:

ANALYZE Orders;
ANALYZE Customers;

Check for the exeuction plan of the simplest query:

EXPLAIN ANALYZE
SELECT * FROM Customers NATURAL JOIN Orders WHERE shipped

 | QUERY PLAN                                                                                                            |
 | :-------------------------------------------------------------------------------------------------------------------- |
 | Hash Join  (cost=28.50..705.05 rows=21131 width=24) (actual time=0.946..29.305 rows=21131 loops=1)                    |
 |   Hash Cond: (orders.customer_id = customers.customer_id)                                                             |
 |   ->  Seq Scan on orders  (cost=0.00..386.00 rows=21131 width=16) (actual time=0.008..10.089 rows=21131 loops=1)      |
 |         Filter: shipped                                                                                               |
 |         Rows Removed by Filter: 3869                                                                                  |
 |   ->  Hash  (cost=16.00..16.00 rows=1000 width=12) (actual time=0.906..0.906 rows=1000 loops=1)                       |
 |         Buckets: 1024  Batches: 1  Memory Usage: 52kB                                                                 |
 |         ->  Seq Scan on customers  (cost=0.00..16.00 rows=1000 width=12) (actual time=0.005..0.445 rows=1000 loops=1) |
 | Planning time: 0.419 ms                                                                                               |
 | Execution time: 34.613 ms                                                                                             |
 

Check for the execution plan of the second version of the query:

 EXPLAIN ANALYZE
 SELECT * FROM Customers NATURAL JOIN (SELECT * FROM Orders WHERE shipped) AS O

 | QUERY PLAN                                                                                                            |
 | :-------------------------------------------------------------------------------------------------------------------- |
 | Hash Join  (cost=28.50..705.05 rows=21131 width=24) (actual time=0.693..24.537 rows=21131 loops=1)                    |
 |   Hash Cond: (orders.customer_id = customers.customer_id)                                                             |
 |   ->  Seq Scan on orders  (cost=0.00..386.00 rows=21131 width=16) (actual time=0.007..8.534 rows=21131 loops=1)       |
 |         Filter: shipped                                                                                               |
 |         Rows Removed by Filter: 3869                                                                                  |
 |   ->  Hash  (cost=16.00..16.00 rows=1000 width=12) (actual time=0.676..0.676 rows=1000 loops=1)                       |
 |         Buckets: 1024  Batches: 1  Memory Usage: 52kB                                                                 |
 |         ->  Seq Scan on customers  (cost=0.00..16.00 rows=1000 width=12) (actual time=0.003..0.312 rows=1000 loops=1) |
 | Planning time: 0.263 ms                                                                                               |
 | Execution time: 29.083 ms                                                                                             |
 

You can see that, in this case, PostgreSQL (9.6.2) uses exactly the same execution plan. This is what will happen most often.

dbfiddle here


A slightly different version, where one index is defined (and where we decided that shipped orders were minority, not majority):

CREATE INDEX idx_shipped_orders ON Orders(shipped, order_id) ; 

... works with a different plan, but again, the same execution plan for both the simple and the shipped filtered first queries:

 | QUERY PLAN                                                                                                                               |
 | :--------------------------------------------------------------------------------------------------------------------------------------- |
 | Hash Join  (cost=28.78..214.66 rows=3822 width=24) (actual time=0.750..5.985 rows=3822 loops=1)                                          |
 |   Hash Cond: (orders.customer_id = customers.customer_id)                                                                                |
 |   ->  Index Scan using shipped_orders_idx on orders  (cost=0.28..133.61 rows=3822 width=16) (actual time=0.008..1.873 rows=3822 loops=1) |
 |   ->  Hash  (cost=16.00..16.00 rows=1000 width=12) (actual time=0.730..0.730 rows=1000 loops=1)                                          |
 |         Buckets: 1024  Batches: 1  Memory Usage: 52kB                                                                                    |
 |         ->  Seq Scan on customers  (cost=0.00..16.00 rows=1000 width=12) (actual time=0.008..0.345 rows=1000 loops=1)                    |
 | Planning time: 0.209 ms                                                                                                                  |
 | Execution time: 6.863 ms                                                                                                                 |
 

dbfiddle here


NOTE: The small differences in execution time can be due to any external factors, such as other processes using the database at the same time or not. They should only be taken into consideration if you perform the queries a statistically significant number of times, and perform the adequate statistical test for significance.

NOTE 2: I've used the NATURAL JOIN as you did, to not introduce more elements to the analysis. In practice, I'd rather use either ON t1.col = t2.col (compatible with mostly all databases) or USING (col). The meaning of both remains constant. NATURAL JOIN can change meanings if you add columns to your tables. For instance, you might decide to add columns for last_modified_at, last_modified_by, to track the "age" of your data ... and your NATURAL JOINs stop working. In short: don't use them. NATURAL JOIN does not refer to joining using the columns participating in a foreign key constraint, as you might have thought. They just refer to columns having the same name in both tables. And that's risky. I have plenty of tables with columns called created_at and last_modified_at, for instance, and it doesn't make any sense to use them to JOIN.