Mysql – optimize the query with big and low cardinality

mysql-5.6optimization

I have big enough dataset and few joins to get them all

with help next query:

SELECT DISTINCT c.c_id FROM c_active z1 INNER JOIN cs c ON (z1.cv_id=c.cv_id) INNER JOIN indi i ON (c.m_id=i.m_id) INNER JOIN c_loc cl ON (z1.c_id=c.c_id) INNER JOIN profs cp ON (z1.c_id=cp.c_id) WHERE i.sex='2' AND c.lang='en' AND cl.is_country='0' AND cl.location_id IN (3,4,5,6) AND (cp.cat_id IN ('13', '2', '20'))

and this is execution plan which is provided by mysql 5.6

+----+-------------+-------+--------+------------------------+---------+---------+--------+----------+---------------------------------------------------------------------------+
| id | select_type | table | type   | key                    | key_len | ref     | rows   | filtered | Extra                                                                     |
+----+-------------+-------+--------+------------------------+---------+---------+--------+----------+---------------------------------------------------------------------------+
|  1 | SIMPLE      | i     | ref    | sex                    | 1       | const   | 306937 |   100.00 | Using index; Using temporary                                              |
|  1 | SIMPLE      | c     | ref    | m_id                   | 4       | i.m_id  |      1 |   100.00 | Using where                                                               |
|  1 | SIMPLE      | z1    | eq_ref | PRIMARY                | 4       | c.c_id  |      1 |   100.00 | Using index; Distinct                                                     |
|  1 | SIMPLE      | cp    | ref    | c_id                   | 4       | c.c_id  |      1 |   100.00 | Using where; Distinct                                                     |
|  1 | SIMPLE      | cl    | range  | is_country_location_id | 4       | NULL    | 936608 |   100.00 | Using where; Using index; Distinct; Using join buffer (Block Nested Loop) |
+----+-------------+-------+--------+------------------------+---------+---------+--------+----------+---------------------------------------------------------------------------+

Is there any way how to make that query faster ?
I see that in last line number of rows is huge, and this is the reason why it is slow.

I see that cardinality for cl.c_id table is huge and for cl.is_country_location_id is low

+-------+------------+------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name               | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| c_loc |          0 | PRIMARY                |            1 | id          | A         |     1211146 |     NULL | NULL   |      | BTREE      |         |               |
| c_loc |          1 | cv_id                  |            1 | c_id        | A         |     1211146 |     NULL | NULL   |      | BTREE      |         |               |
| c_loc |          1 | is_country_location_id |            1 | is_country  | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| c_loc |          1 | is_country_location_id |            2 | location_id | A         |         574 |     NULL | NULL   |      | BTREE      |         |               |
+-------+------------+------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

and from that point maybe this is why mysql going to go through so many records.

But what could be the strategy to optimize such query ?

Best Answer

To check exactly how selective your filter is for your particular query, you could execute the following queries:

SELECT 
   (SELECT count(*) FROM indi WHERE sex = '2')
   /
   (SELECT count(*) FROM indi)
   * 100
   as selectivity;

SELECT 
   (SELECT count(*) FROM c_loc WHERE location_id IN (3,4,5,6))
   /
   (SELECT count(*) FROM c_loc)
   * 100
   as selectivity;

The first query will probably return something around 50% in most applications, the second will give you around 1%, if the locations are more or less equally distributed, but it can vary.

Having a low cardinality is not per se a bad thing, the problem is when you need to search only on that column in a efficient way with an index. Having low cardinality means that a full table scan may actually be faster. The limit depends on many factors, and the MySQL optimizer is a bit more complex than a simple percentage calculation, but here you have a real example of when an index is no longer useful:

Full table scan vs. index

When the selectivity is higher than around 20%, the index tends to be no longer useful for filtering.

There is no "real" solution modifying the query. For those cases, you may need a different method for indexing, different than the standard BTREE algorithm. In most cases, you can provide a solution on MySQL using partitioning (even doing it manually, storing on a different table males and females, for example).

For your particular case, the index is still being used, which means it may be still good enough, and you may not need to do any additional work, but I though it was important to understand the consequences of low selectivity, and playing around with indexes to see if you were getting an optimal query plan. However, if the final number of rows is relatively large, there is no good solution: if you select a lot of rows, it will take a lot of time to do it. If your final result set is relatively small, try to force the join order with the most selective clauses at the beginning, by using STRAIGHT_JOIN, if the logic allows it. This is dangerous without proper monitoring, because cardinality may change with time, and so does the selectivity for the same query just by using slightly different parameters.

Related Solutions

Mysql – Help optimizing MySQL slow query

I would like to get rid of "Using temporary; Using filesort"

One of the problems I see is that you're using different GROUP BY and ORDER BY clauses. From the manual on how MySQL uses temporary tables:

If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.

As soon as you create a temporary table, it will need to be sorted according to your ORDER BY clause, indicated by 'using filesort'.

This execution plan at leasts uses the indexes to appropriately limit the number of rows found.

I would also look through the docs on ORDER BY optimization.

SQL Server 2008 R2 – Unexpected Scans During Delete Operation Using WHERE IN

"I'm more wondering why the query optimizer would ever use the plan it currently does."

To put it another way, the question is why the following plan looks cheapest to the optimizer, compared with the alternatives (of which there are many).

Original Plan

The inner side of the join is essentially running a query of the following form for each correlated value of BrowserID:

DECLARE @BrowserID smallint;

SELECT 
    tfsph.BrowserID 
FROM dbo.tblFEStatsPaperHits AS tfsph 
WHERE 
    tfsph.BrowserID = @BrowserID 
OPTION (MAXDOP 1);

Paper Hits Scan

Note that the estimated number of rows is 185,220 (not 289,013) since the equality comparison implicitly excludes NULL (unless ANSI_NULLS is OFF). The estimated cost of the above plan is 206.8 units.

Now let's add a TOP (1) clause:

DECLARE @BrowserID smallint;

SELECT TOP (1)
    tfsph.BrowserID 
FROM dbo.tblFEStatsPaperHits AS tfsph 
WHERE 
    tfsph.BrowserID = @BrowserID 
OPTION (MAXDOP 1);

With TOP (1)

The estimated cost is now 0.00452 units. The addition of the Top physical operator sets a row goal of 1 row at the Top operator. The question then becomes how to derive a 'row goal' for the Clustered Index Scan; that is, how many rows should the scan expect to process before one row matches the BrowserID predicate?

The statistical information available shows 166 distinct BrowserID values (1/[All Density] = 1/0.006024096 = 166). Costing assumes that the distinct values are distributed uniformly over the physical rows, so the row goal on the Clustered Index Scan is set to 166.302 (accounting for the change in table cardinality since the sampled statistics were gathered).

The estimated cost of scanning the expected 166 rows is not very large (even executed 339 times, once for each change of BrowserID) - the Clustered Index Scan shows an estimated cost of 1.3219 units, showing the scaling effect of the row goal. The unscaled operator costs for I/O and CPU are shown as 153.931, and 52.8698 respectively:

Row Goal Scaled Estimated Costs

In practice, it is very unlikely that the first 166 rows scanned from the index (in whatever order they happen to be returned) will contain one each of the possible BrowserID values. Nevertheless, the DELETE plan is costed at 1.40921 units total, and is selected by the optimizer for that reason. Bart Duncan shows another example of this type in a recent post titled Row Goals Gone Rogue.

It is also interesting to note that the Top operator in the execution plan is not associated with the Anti Semi Join (in particular the 'short-circuiting' Martin mentions). We can start to see where the Top comes from by first disabling an exploration rule called GbAggToConstScanOrTop:

DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers 
WHERE BrowserID NOT IN 
(
    SELECT DISTINCT BrowserID 
    FROM tblFEStatsPaperHits WITH (NOLOCK) 
    WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, LOOP JOIN, RECOMPILE);
GO
DBCC RULEON ('GbAggToConstScanOrTop');

GbAggToConstScanOrTop Disabled

That plan has an estimated cost of 364.912, and shows that the Top replaced a Group By Aggregate (grouping by the correlated column BrowserID). The aggregate is not due to the redundant DISTINCT in the query text: it is an optimization that can be introduced by two exploration rules, LASJNtoLASJNonDist and LASJOnLclDist. Disabling those two as well produces this plan:

DBCC RULEOFF ('LASJNtoLASJNonDist');
DBCC RULEOFF ('LASJOnLclDist');
DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers 
WHERE BrowserID NOT IN 
(
    SELECT DISTINCT BrowserID 
    FROM tblFEStatsPaperHits WITH (NOLOCK) 
    WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, LOOP JOIN, RECOMPILE);
GO
DBCC RULEON ('LASJNtoLASJNonDist');
DBCC RULEON ('LASJOnLclDist');
DBCC RULEON ('GbAggToConstScanOrTop');

Spool Plan

That plan has an estimated cost of 40729.3 units.

Without the transformation from Group By to Top, the optimizer 'naturally' chooses a hash join plan with BrowserID aggregation before the anti semi join:

DBCC RULEOFF ('GbAggToConstScanOrTop');
GO
DELETE FROM tblFEStatsBrowsers 
WHERE BrowserID NOT IN 
(
    SELECT DISTINCT BrowserID 
    FROM tblFEStatsPaperHits WITH (NOLOCK) 
    WHERE BrowserID IS NOT NULL
) OPTION (MAXDOP 1, RECOMPILE);
GO
DBCC RULEON ('GbAggToConstScanOrTop');

No Top DOP 1 Plan

And without the MAXDOP 1 restriction, a parallel plan:

No Top Parallel Plan

Another way to 'fix' the original query would be to create the missing index on BrowserID that the execution plan reports. Nested loops work best with when the inner side is indexed. Estimating cardinality for semi joins is challenging at the best of times. Not having proper indexing (the large table doesn't even have a unique key!) will not help at all.

I wrote more about this in Row Goals, Part 4: The Anti Join Anti Pattern.

Best Answer

Related Solutions

Mysql – Help optimizing MySQL slow query

SQL Server 2008 R2 – Unexpected Scans During Delete Operation Using WHERE IN

Related Question