I would have guessed that when a query includes TOP n the database
engine would run the query ignoring the the TOP clause, and then at
the end just shrink that result set down to the n number of rows that
was requested. The graphical execution plan seems to indicate this is
the case -- TOP is the "last" step. But it appears there is more going
on.
The way the above is phrased makes me think you may have an incorrect mental picture of how a query executes. An operator in a query plan is not a step (where the full result set of a previous step is evaluated by the next one.
SQL Server uses a pipelined execution model, where each operator exposes methods like Init(), GetRow(), and Close(). As the GetRow() name suggests, an operator produces one row at a time on demand (as required by its parent operator). This is documented in the Books Online Logical and Physical Operators reference, with more detail in my blog post Why Query Plans Run Backwards. This row-at-a-time model is essential in forming a sound intuition for query execution.
My question is, how (and why) does a TOP
n clause impact the execution
plan of a query?
Some logical operations like TOP
, semi joins and the FAST n
query hint affect the way the query optimizer costs execution plan alternatives. The basic idea is that one possible plan shape might return the first n rows more quickly than a different plan that was optimized to return all rows.
For example, indexed nested loops join is often the fastest way to return a small number of rows, though hash or merge join with scans might be more efficient on larger sets. The way the query optimizer reasons about these choices is by setting a Row Goal at a particular point in the logical tree of operations.
A row goal modifies the way query plan alternatives are costed. The essence of it is that the optimizer starts by costing each operator as if the full result set were required, sets a row goal at the appropriate point, and then works back down the plan tree estimating the number of rows it expects to need to examine to meet the row goal.
For example, a logical TOP(10)
sets a row goal of 10 at a particular point in the logical query tree. The costs of operators leading up to the row goal are modified to estimate how many rows they need to produce to meet the row goal. This calculation can become complex, so it is easier to understand all this with a fully worked example and annotated execution plans. Row goals can affect more than the choice of join type or whether seeks and lookups are preferred to scans. More details on that here.
As always, an execution plan selected on the basis of a row goal is subject to the optimizer's reasoning abilities and the quality of information provided to it. Not every plan with a row goal will produce the required number of rows faster in practice, but according to the costing model it will.
Where a row goal plan proves not to be faster, there are usually ways to modify the query or provide better information to the optimizer such that the naturally selected plan is best. Which option is appropriate in your case depends on the details of course. The row goal feature is generally very effective (though there is a bug to watch out for when used in parallel execution plans).
Your particular query and plan may not be suitable for detailed analysis here (by all means provide an actual execution plan if you wish) but hopefully the ideas outlined here will allow you to make forward progress.
I believe the settings you mentioned already ensure that geqo will be used, except that the values for geqo_pool_size
and geqo_generations
are so low that the results won't be good.
Apart from that, to get different plans, you should play with geqo_seed
.
From the doc's section Generating Possible Plans with GEQO:
As long as geqo_seed and the other GEQO parameters are kept fixed, the
same plan will be generated for a given query (and other planner
inputs such as statistics). To experiment with different search paths,
try changing geqo_seed.
About geqo_seed:
Controls the initial value of the random number generator used by GEQO
to select random paths through the join order search space. The value
can range from zero (the default) to one. Varying the value changes
the set of join paths explored, and may result in a better or worse
best path being found.
Best Answer
No, there is no way to save a query plan to disk, force a query plan, etc.
Fix your optimizer parameters and make sure your stats are accurate. You didn't bother to mention your version, show
explain analyze
output, etc so I can't help you with that in detail. Start here:http://wiki.postgresql.org/wiki/Slow_Query_Questions
Check your rowcount estimates vs actual results in
explain analyze
output. http://explain.depesz.com/ makes this easier.Check that cost estimates reflect reality. Consider adjusting
random_page_cost
/seq_page_cost
if your random I/O is faster or slower than PostgreSQL's estimates. Also ensureeffective_page_size
reasonably reflects the amount of RAM available for caching.If you continue to have issues after making appropriate changes per the manual and tuning guides, please report a detailed description of the problem and a self-contained test case to the pgsql-peform mailing list.
See:
My personal opinion is that PostgreSQL's policy of having no hints (except actually there are hints, they're just uglier hacks than actual hints would be) is user-hostile and pointless. Most users will use whatever hacks they can to avoid actually understanding cost-based optimizers, so we're just forcing them to use uglier hacks. Users who want to understand and do it right will still find times the optimizer produces inappropriate results, and while it's good to fix those, it's also good for those users to have a workaround in the mean time. And users who're cost based optimizer purists don't have to use any hinting features that are there, including the not-hints that already exist.
This drives me especially nuts with common table expressions, where PostgreSQL's quirky interpretation of them as optimization fences is going to cause us pain and suffering down the road when we want to start optimizing into them, but discover everyone's been using them as a sort of query hint workaround for the query hints we don't have.
Examples of not-hints include:
enable_
parametersOFFSET 0