Postgresql – Are WHERE clauses applied in the order they are written

performancepostgresqlpostgresql-performance

I'm trying to optimize a query which looks into a big table (37 millions rows) and have a question about what order the operations are executed in a query.

select 1 
from workdays day
where day.date_day >= '2014-10-01' 
    and day.date_day <= '2015-09-30' 
    and day.offer_id in (
        select offer.offer_day 
        from offer  
        inner join province on offer.id_province = province.id_province  
        inner join center cr on cr.id_cr = province.id_cr 
        where upper(offer.code_status) <> 'A' 
            and province.id_region in ('10' ,'15' ,'21' ,'26' ,'31' , ...,'557') 
            and province.id_cr in ('9' ,'14' ,'20' ,'25' ,'30' ,'35' ,'37')
    )

Are the WHERE clauses for the date range executed before the subquery?
Is it a good way to put the most restrictive clauses first to avoid big loops for other clauses, in order to do a faster execution?

Now queries takes so much time to execute.

Best Answer

To elaborate on @alci's answer:

PostgreSQL doesn't care what order you write things in

PostgreSQL doesn't care at all about the order of entries in a WHERE clause, and chooses indexes and execution order based on cost and selectivity estimation alone.
The order in which joins are written is also ignored up to the configured join_collapse_limit; if there are more joins than that, it'll execute them in the order they're written.
Subqueries can be executed before or after the query that contains them, depending on what's fastest, so long as the subquery is executed before the outer query actually needs the information. Often in reality the subquery gets executed kind-of in the middle, or interleaved with the outer query.
There's no guarantee PostgreSQL will actually execute parts of the query at all. They can be completely optimized away. This is important if you call functions with side-effects.

PostgreSQL will transform your query

PostgreSQL will heavily transform queries while retaining the exact same effects, in order to make them run faster while not changing the results.

Terms outside a subquery can get pushed down into the subquery so they execute as part of the subquery not where you wrote them in the outer query
Terms in the subquery can be pulled up to the outer query so their execution is done as part of the outer query, not where you wrote them in the subquery
The subquery can, and often is, flattened into a join on the outer table. The same is true of things like EXISTS and NOT EXISTS queries.
Views get flattened into the query that uses the view
SQL functions often get inlined into the calling query
... and there are numerous other transformations made to queries, such as constant expression pre-evaluation, de-correlation of some subqueries, and all sorts of other planner/optimizer tricks.

In general PostgreSQL can massively transform and rewrite your query, to the point where each of these queries:

select my_table.*
from my_table
left join other_table on (my_table.id = other_table.my_table_id)
where other_table.id is null;

select *
from my_table
where not exists (
  select 1
  from other_table
  where other_table.my_table_id = my_table.id
);

select *
from my_table
where my_table.id not in (
  select my_table_id
  from other_table
  where my_table_id is not null
);

will usually all produce exactly the same query plan. (Assuming I didn't make any dumb mistakes in the above anyway).

It's not uncommon to try to optimize a query only to find that the query planner has already figured out the tricks you're trying and applied them automatically, so the hand-optimized version is no better than the original.

Limitations

The planner/optimizer is far from omnicient, and is limited by the requirement to be absolutely certain it can't change the effects of the query, the available data to make decisions with, the rules that've been implemented, and the CPU time it can afford to spend pondering the optimizations. For example:

The planner relies on statistics kept by ANALYZE (usually via autovacuum). If these are outdated, the plan choice can be bad.
The statistics are only a sample, so they can be misleading due to sampling effects, especially if too small a sample is taken. Bad plan choices can result.
The statistics don't keep track of some kinds of data about the table, like correlations between columns. This can lead the planner to make bad decisions when it assumes things are independent when they aren't.
The planner relies on cost parameters like random_page_cost to tell it the relative speed of various operations on the particular system it's installed on. These are only guides. If they're badly wrong they can lead to poor plan choices.
Any subquery with a LIMIT or OFFSET cannot be flattened or be subject to pullup / pushdown. This doesn't mean it'll execute before all parts of the outer query, though, or even that it'll execute at all.
CTE terms (the clauses in a WITH query) are always executed in their entirety, if they're executed at all. They can't be flattened, and terms can't be pushed up or pulled down across the CTE term barrier. CTE terms are always executed before the final query. This is non-SQL-standard behaviour, but it's documented as how PostgreSQL does things.
PostgreSQL has a limited ability to optimize across queries on foreign tables, security_barrier views, and certain other special kinds of relation
PostgreSQL won't inline a function written in anything except plain SQL, nor do pullup/pushdown between it and the outer query.
The planner/optimizer is really dumb about selecting expression indexes, and about trivial data type differences between index and expression.

Tons more, too.

Your query

In the case of your query:

select 1 
from workdays day
where day.date_day >= '2014-10-01' 
    and day.date_day <= '2015-09-30' 
    and day.offer_id in (
        select offer.offer_day 
        from offer  
        inner join province on offer.id_province = province.id_province  
        inner join center cr on cr.id_cr = province.id_cr 
        where upper(offer.code_status) <> 'A' 
            and province.id_region in ('10' ,'15' ,'21' ,'26' ,'31' , ...,'557') 
            and province.id_cr in ('9' ,'14' ,'20' ,'25' ,'30' ,'35' ,'37')
    )

nothing stops it from being flattened into a simpler query with an extra set of joins, and it very likely will be.

It'll probably turn out something like (untested, obviously):

select 1 
from workdays day
inner join offer on day.offer_id = offer.offer_day
inner join province on offer.id_province = province.id_province  
inner join center cr on cr.id_cr = province.id_cr 
where upper(offer.code_status) <> 'A' 
   and province.id_region in ('10' ,'15' ,'21' ,'26' ,'31' , ...,'557') 
   and province.id_cr in ('9' ,'14' ,'20' ,'25' ,'30' ,'35' ,'37')
   and day.date_day >= '2014-10-01' 
   and day.date_day <= '2015-09-30';

PostgreSQL will then optimise the join order and join methods based on its selectivity and row count estimates and the available indexes. If these reasonably reflect reality then it'll do the joins and run the where clause entries in whatever order is best - often mixing them together, so it does a bit of this, then a bit of that, then comes back to the first part, etc.

How to see what the optimizer did

You can't see the SQL that PostgreSQL optimizes your query into, because it converts the SQL to an internal query tree representation then modifies that. You can dump the query plan and compare it to other queries.

There's no way to "deparse" that query plan or the internal plan tree back to SQL.

http://explain.depesz.com/ has a decent query plan helper. If you're totally new to query plans etc (in which case I'm amazed you made it this far through this post) then PgAdmin has a graphical query plan viewer that provides much less information but is simpler.

Related reading:

Pushdown/pullup and flattening capabilities continue to improve in each release. PostgreSQL is usually right about pull-up/push-down/flattening decisions, but not always, so occasionally you have to (ab)use a CTE or the OFFSET 0 hack. If you find such a case, report a query planner bug.

If you're really, really keen you can also use the debug_print_plans option to see the raw query plan, but I promise you don't want to read that. Really.

Related Solutions

SQL Server Join Processing – Understanding Join/Where Processing Order

The logical processing of a query is on MSDN (written by Microsoft SQL Server team, not 3rd parties)

1. FROM
2. ON
3. JOIN
4. WHERE
5. GROUP BY
6. WITH CUBE or WITH ROLLUP
7. HAVING
8. SELECT
9. DISTINCT
10. ORDER BY
11. TOP

A derived table follows this, then the outer query does it again etc etc

This is logical though: not actual. No matter how SQL Server actually does it, these semantics are honoured to the letter. The "actual" is determined by the Query Optimiser (QO) and you avoid the intermediate Cartesion product you mentioned.

It's worth mentioning that SQL is declarative: you say "what" not "how" like you would for a procedural/imperative programming (Java, .net). So saying "this happens before that" is wrong in many cases (eg assumption of short circuits or L-to-R WHERE order)

In your case above, the QO will generate the same plan no matter how it is structured because it is a simple query.

However, the QO is cost based and for a complex query it may take 2 weeks to generate the ideal plan. So it does "good enough" which actually isn't.

So your first case may help the optimiser find a better plan because the logical processing order is different for the 2 queries. But it may not.

I have used this trick on SQL Server 2000 to get 60x speed performance improvement on reporting queries. As the QO improves version to version it gets better at working these things out.

And the book you mentioned: there is some dispute over it
See SO and the subsequent links: https://stackoverflow.com/q/3270338/27535

SQL Server – How to Optimize a Slow Query with Nested Loops Inner Join

The problem appears to be in this part of the code:

JOIN category_link l on l.sku_id IN (SELECT value FROM #Ids) AND
(
    l.category_id = c4.category_id OR
    l.category_id = c5.category_id
)

or in join conditions is always suspicious. One suggestion is to split this into two joins:

JOIN category_link l1 on l1.sku_id in (SELECT value FROM #Ids) and l1.category_id = cr.category_id
left outer join
category_link l1 on l2.sku_id in (SELECT value FROM #Ids) and l2.category_id = cr.category_id

You then have to modify the rest of the query to handle this . . . coalesce(l1.sku_id, l2.sku_id) for instance in the select clause.