Mysql – Cardinality discrepancy

MySQLperformancestatistics

I have two, almost identical, databases and I am trying to figure out what is causing queries to perform that much slower in one of them. While investigating using the information_schema.statistics table I noticed that the cardinality column for one of the indexed fields was triple in size compared to the same column in the other database. e.g.
For the first database the cardinality is

+------------------------+-------------+
| column_name            | cardinality |
+------------------------+-------------+
| date_time              |      46202 |
+------------------------+-------------+

and for the second database:

+------------------------+-------------+
| column_name            | cardinality |
+------------------------+-------------+
| date_time              |      185052 |
+------------------------+-------------+

Since the cardinality is

An estimate of the number of unique values in the index.

I got wondering about the unique date_time for this columns. Turns out that they are the same (select distinct date_time from table = 54187).

So if the unique values are exactly the same, how come the cardinality in the second database for the column date_time is almost four times the one on the first database?

Best Answer

The cardinalities are estimates.
They have limited use, so don't worry if the values are as far off as you see.

If you are worried that some queries are being Optimized inadequately, let's see the queries, SHOW CREATE TABLE, and EXPLAIN SELECT ... There are many things that can be looked at; most do not involve cardinality.

A typical solution involves a "composite" index. For example:

WHERE a=1 AND b=2

If you have INDEX(a), INDEX(b), the Optimizer will use only one of the indexes and may use cardinality to decide between them.

If you have INDEX(a,b), the decision is obvious, and cardinality is irrelevant, and the query will run faster.

Related Solutions

Mysql – Low Cardinality of Index performing better than High Cardinality

The between condition is short form for:

 ... range_from <= $value 
 and $value <= range_to

In this context range_from and range_to have an effective cardinality of 2, in range or out of range.

Given your indexes the range_from needs to be searched for all values up to and including $value. If value is close to the end the possible values, the number of records that need to be checked is quite high. If you are looking for recent records, this is will be close to all the records.

Using the index with the depot column first, far fewer records match the range_from value. If you have 100 depots, you have 1% of the records to check compared to not using it. Depots with few records will be faster than those with more records. Your results on table A will vary depending on the value of $other.

In either case, the query could be completed within the index. Without depot in the index, the query optimizer might decide to scan the table rather than the index. Some SQL dialects allow a hint to the optimizer indicating the number of rows the query will match. This can influence the resulting query plan.

While you consider the cardinality low, from an optimization standpoint, it is quite high it reduces the number of index rows to be scanned by a factor of 100.

While the cardinality of the from_range may be high from a values standpoint. When running a range search it is of little value in searching for records. For high values (which may be the most common case), it will match close to 100% of the records. Its high cardinality, is of more use when running a range check of the form:

where from_range between $low and $high

The index may be more useful if the range columns are ordered range_to, range_from.

MySQL Performance – Special Function Slow in WHERE Clause but Not in SELECT

Here's the difference between the two statements:

The first statement calls the levenshtein function only on the retrieved resultset.

The second statement calls the levenshtein and the length function as a so called "predicate" for every row in the table to retrieve the resultset. Even if levenshtein is only evaluated if the first condition is true, then there's the overhead of doing additional comparisons.

Thus, the second query is naturally slower.

IMHO this is a CPU bound problem. To make this query faster, either supply more predicates that can be evaluated (means: add more restrictions to the where clause), add more CPU power and try parallel selects (if mysql supports that).

Best Answer

Related Solutions

Mysql – Low Cardinality of Index performing better than High Cardinality

MySQL Performance – Special Function Slow in WHERE Clause but Not in SELECT

Related Question