What is more important on a query execution plan, cardinality or selectivity

execution-planoptimizationperformancequery-performancestatistics

The paper "How good are query optimizers, really?" says that cardinality is the thing that most influence to find a better query execution plan in RDBMS. Summary of this paper can be found here. However, other posts that explain the difference between cardinality and selectivity point that selectivity is more important.

Example:

select max(price) from tickets where country = ‘CANADA’;

Let us say we have a table: tickets with 180 rows. Based on the above
example, there are 10 rows in the table for country = ‘CANADA’. The
query returns only one row, the max(price). selectivity = ?
cardinality = ?

selectivity = number of rows accessed/total number of rows = 10/180 =
0.05 (5% of the rows were accessed) cardinality = number of rows accessed = 10

My opinion is that selectivity is only found once we have the cardinality, therefore cardinality remains the more important thing to find the best query execution plan. What do you think? Could you please answer with a detail explanation about your understanding? Thanks

Best Answer

Since one is the function of the other, they are equally important.

Implementation details differ between various DBMSes, but in general the process that collects table and index statistics calculates cardinality for key columns and stores these values wherever it keeps statistic information, usually some catalog tables.

For example, in addition to the full table cardinality (180 in your example for table tickets) it may gather and store cardinalities (i.e. the number of distinct values) for some columns. This allows the optimizer to estimate selectivity for those columns, dividing the column cardinality by the table cardinality.

Many DBMSes also maintain histograms for key columns, that is, cardinalities of specific values in these columns. In your example the engine might store among other statistics the fact that there are 10 rows with 'Canada' in the country column, 18 rows with 'USA', 8 rows with 'Brazil' etc.

Similarly, statistics for an index would contain not only the total number of (potentially) non-unique key values, but also the number of distinct key values for the entire key and also often for each "sub-key"*, allowing the optimizer to compute selectivity when needed.

* As an example, for a non-unique index containing columns A, B, and C the engine would store the full index cardinality (which is equal to the table cardinality), and the distinct counts of combinations (A, B, C), (A, B), and (A).

Related Solutions

Sql-server – Execution Plans Query

Use:

set showplan_xml on;
go
Your query;
go
set showplan_xml off;
go

This will give the XML, and you should be able to see the Estimates in there.

Sql-server – How (and why) does TOP impact an execution plan

I would have guessed that when a query includes TOP n the database engine would run the query ignoring the the TOP clause, and then at the end just shrink that result set down to the n number of rows that was requested. The graphical execution plan seems to indicate this is the case -- TOP is the "last" step. But it appears there is more going on.

The way the above is phrased makes me think you may have an incorrect mental picture of how a query executes. An operator in a query plan is not a step (where the full result set of a previous step is evaluated by the next one.

SQL Server uses a pipelined execution model, where each operator exposes methods like Init(), GetRow(), and Close(). As the GetRow() name suggests, an operator produces one row at a time on demand (as required by its parent operator). This is documented in the Books Online Logical and Physical Operators reference, with more detail in my blog post Why Query Plans Run Backwards. This row-at-a-time model is essential in forming a sound intuition for query execution.

My question is, how (and why) does a TOP n clause impact the execution plan of a query?

Some logical operations like TOP, semi joins and the FAST n query hint affect the way the query optimizer costs execution plan alternatives. The basic idea is that one possible plan shape might return the first n rows more quickly than a different plan that was optimized to return all rows.

For example, indexed nested loops join is often the fastest way to return a small number of rows, though hash or merge join with scans might be more efficient on larger sets. The way the query optimizer reasons about these choices is by setting a Row Goal at a particular point in the logical tree of operations.

A row goal modifies the way query plan alternatives are costed. The essence of it is that the optimizer starts by costing each operator as if the full result set were required, sets a row goal at the appropriate point, and then works back down the plan tree estimating the number of rows it expects to need to examine to meet the row goal.

For example, a logical TOP(10) sets a row goal of 10 at a particular point in the logical query tree. The costs of operators leading up to the row goal are modified to estimate how many rows they need to produce to meet the row goal. This calculation can become complex, so it is easier to understand all this with a fully worked example and annotated execution plans. Row goals can affect more than the choice of join type or whether seeks and lookups are preferred to scans. More details on that here.

As always, an execution plan selected on the basis of a row goal is subject to the optimizer's reasoning abilities and the quality of information provided to it. Not every plan with a row goal will produce the required number of rows faster in practice, but according to the costing model it will.

Where a row goal plan proves not to be faster, there are usually ways to modify the query or provide better information to the optimizer such that the naturally selected plan is best. Which option is appropriate in your case depends on the details of course. The row goal feature is generally very effective (though there is a bug to watch out for when used in parallel execution plans).

Your particular query and plan may not be suitable for detailed analysis here (by all means provide an actual execution plan if you wish) but hopefully the ideas outlined here will allow you to make forward progress.

Best Answer

Related Solutions

Sql-server – Execution Plans Query

Sql-server – How (and why) does TOP impact an execution plan

Related Question