Sql-server – Example of query execution plan where the cardinality estimation makes a difference

cardinality-estimatesexecution-planoptimizationsql server

I am trying to create an example of a query execution plan which shows that if I have the cardinality estimation for some column of a table I can decide which query execution plan has the least cost.

So I have these two posts (1 and 2) which I can understand estimated cardinality as the number of distinct values on a column. And then I calculate the selectivity.

SELECT max(price) FROM tickets WHERE country = "CANADA";

If my table has 180 items and only 10 are CANADA. And there are only 4 different countries (CANADA, BRAZIL, USA, GERMANY). So….

the cardinality of column country é 4, because it is the number of distinct items.

The selectivity of the column country = CANADA is the number of items accessed divided by the items on the table. 10/180 = 0.0555.

But which are the different query execution plans that the optimizer could choose based on the cardinality?

If this is not a good example, could someone point one query that the cardinality would be a treasure to the optimizer in order to decide to chose one plan versus another?

Thanks,
Felipe

Best Answer

At 180 rows, you're not likely to see any difference whatsoever.

In order to see different plans get chosen at different predicate distributions, you'd need quite bit more data, and likely competing data sources (indexes) to aid them.

Let's say you had 1.8mm rows in the table. If you've got

A clustered index on a column unrelated to price or country (Like Id or something)
A nonclustered index on price and country
A single column nonclustered index on country

You may start to see a difference in the method of aggregation, serial vs. parallel plans chosen, and index choice, depending on which country you search for and its selectivity.

For an example of a recent Q&A:

Large variances between two similar queries when doing WHERE EXISTS on related table

For a more complex reasons:

Why is my query suddenly slower than it was yesterday?

Related Solutions

Different execution plan for the same query if I change a value in the predicate

The problem was histograms, I ran statistics and disabled histogram creation and the execution plan used nested loops:

BEGIN
  DBMS_STATS.GATHER_table_STATS (OWNNAME => 'MIDAS', TABNAME => 'MINCISOC', 
  METHOD_OPT => 'FOR ALL COLUMNS SIZE 1');
END;

If I run it with FOR ALL COLUMNS SIZE AUTO again the same problem because it uses hash join. Thanks to Phil for the suggestion.

Mysql – optimize the query with big and low cardinality

To check exactly how selective your filter is for your particular query, you could execute the following queries:

SELECT 
   (SELECT count(*) FROM indi WHERE sex = '2')
   /
   (SELECT count(*) FROM indi)
   * 100
   as selectivity;

SELECT 
   (SELECT count(*) FROM c_loc WHERE location_id IN (3,4,5,6))
   /
   (SELECT count(*) FROM c_loc)
   * 100
   as selectivity;

The first query will probably return something around 50% in most applications, the second will give you around 1%, if the locations are more or less equally distributed, but it can vary.

Having a low cardinality is not per se a bad thing, the problem is when you need to search only on that column in a efficient way with an index. Having low cardinality means that a full table scan may actually be faster. The limit depends on many factors, and the MySQL optimizer is a bit more complex than a simple percentage calculation, but here you have a real example of when an index is no longer useful:

Full table scan vs. index

When the selectivity is higher than around 20%, the index tends to be no longer useful for filtering.

There is no "real" solution modifying the query. For those cases, you may need a different method for indexing, different than the standard BTREE algorithm. In most cases, you can provide a solution on MySQL using partitioning (even doing it manually, storing on a different table males and females, for example).

For your particular case, the index is still being used, which means it may be still good enough, and you may not need to do any additional work, but I though it was important to understand the consequences of low selectivity, and playing around with indexes to see if you were getting an optimal query plan. However, if the final number of rows is relatively large, there is no good solution: if you select a lot of rows, it will take a lot of time to do it. If your final result set is relatively small, try to force the join order with the most selective clauses at the beginning, by using STRAIGHT_JOIN, if the logic allows it. This is dangerous without proper monitoring, because cardinality may change with time, and so does the selectivity for the same query just by using slightly different parameters.

Best Answer

Related Solutions

Different execution plan for the same query if I change a value in the predicate

Mysql – optimize the query with big and low cardinality

Related Question