What is more important on a query execution plan, cardinality or selectivity

execution-planoptimizationperformancequery-performancestatistics

The paper "How good are query optimizers, really?" says that cardinality is the thing that most influence to find a better query execution plan in RDBMS. Summary of this paper can be found here. However, other posts that explain the difference between cardinality and selectivity point that selectivity is more important.

Example:

select max(price) from tickets where country = ‘CANADA’;

Let us say we have a table: tickets with 180 rows. Based on the above
example, there are 10 rows in the table for country = ‘CANADA’. The
query returns only one row, the max(price). selectivity = ?
cardinality = ?

selectivity = number of rows accessed/total number of rows = 10/180 =
0.05 (5% of the rows were accessed) cardinality = number of rows accessed = 10

My opinion is that selectivity is only found once we have the cardinality, therefore cardinality remains the more important thing to find the best query execution plan. What do you think? Could you please answer with a detail explanation about your understanding? Thanks

Best Answer

Since one is the function of the other, they are equally important.

Implementation details differ between various DBMSes, but in general the process that collects table and index statistics calculates cardinality for key columns and stores these values wherever it keeps statistic information, usually some catalog tables.

For example, in addition to the full table cardinality (180 in your example for table tickets) it may gather and store cardinalities (i.e. the number of distinct values) for some columns. This allows the optimizer to estimate selectivity for those columns, dividing the column cardinality by the table cardinality.

Many DBMSes also maintain histograms for key columns, that is, cardinalities of specific values in these columns. In your example the engine might store among other statistics the fact that there are 10 rows with 'Canada' in the country column, 18 rows with 'USA', 8 rows with 'Brazil' etc.

Similarly, statistics for an index would contain not only the total number of (potentially) non-unique key values, but also the number of distinct key values for the entire key and also often for each "sub-key"*, allowing the optimizer to compute selectivity when needed.


* As an example, for a non-unique index containing columns A, B, and C the engine would store the full index cardinality (which is equal to the table cardinality), and the distinct counts of combinations (A, B, C), (A, B), and (A).