Sql-server – Filter first then join or join first then filter

optimizationsql serversql-server-2012

Consider 5 tables with more than 1 billion records each.

A query needs to join them. And I know that less than 10% of their records will be require. Say they all have a Date dimension, and only data from current month is needed.

What should be faster:

1) Use a simple SELECT: join all tables, then filter (WHERE) each table's dimension for current month.

2) create 5 temp tables, filtering each source table for current month records, here we can also take the opportunity to select only required columns, then join these temp tables.

Extra possibility:

3) Maintain secondary tables, having only current month/year worth of data. These tables are maintained by same ETL that feeds main ones.

Best Answer

Build up the query and see.
The query optimizer is often smart enough to filter early.
SQL is logical - in the where does not mean it will process last.

Clearly you want indexes on join and filter.

When you get to 5 or more joins the optimizer will often get defensive and go into a loop join.
Do I have a citation - no. It is an observation.

When it gets to 5 or more pulling the conditions into the join can help the optimizer.
Do I have a citation - no. It is an observation.

select * 
from tableA 
join tableB  
      on tableA.fkB = tableB.id 
     and tableA.date1 >= @date1start   
     and tableA.date1 <  @date1end 
join tableC  
      on tableA.fkC = tableC.id 
     and tableC.filter1 = 'do me early'  

Build it up one join at a time and see when it goes stupid.
Optimize one join at at time.

If you are going to materialize then put a pk on the #temp.
Start with where you think you are going to get the most bang.

OR condition in where / join are the most trouble and often lead to a loop join.
Those should be the first to materialize.

You can force hash join but that is a slippery slope that can go bad.

It seems kind of odd all tables have a date dimension, you would typically have some lookup type tables with more static type data.

If you don't need output from the table then a where exists may preform better.