How to Optimize Slow MySQL Query with Index

index-tuningmemoryMySQLoptimizationperformancequery-performance

I've a slow query (~11 seconds) that I'm trying to optimize. But the use of indexes seems to not make any difference even thou Explain indicates otherwise. Let me walk you through my problem.

So this is my query:

SELECT  SQL_NO_CACHE posts.id, posts.p_id, posts.type, posts.message,
       posts.eng_rate, posts.created_time, posts.mention_or_post
    FROM  posts
    JOIN  stats ON stats.TW_ID = posts.TW_ID
    WHERE  stats.category = 'hotel'
      AND  posts.mention_or_post = 'tweet'
      AND  posts.eng_rate > 1
      AND  posts.in_reply_to = 0
      AND  posts.created_time > 1453161930
    ORDER BY  eng_rate DESC
    LIMIT  0, 1000; 

Running a profile on the query (same result with forcing to use Index in statement) gives the results:

+----------------------+-----------+
| Status               | Duration  |
+----------------------+-----------+
| starting             |  0.000063 |
| checking permissions |  0.000091 |
| checking permissions |  0.000803 |
| Opening tables       |  0.000189 |
| System lock          |  0.000052 |
| init                 |  0.000155 |
| optimizing           |  0.000099 |
| statistics           |  0.001616 |
| preparing            |  0.000059 |
| executing            |  0.000046 |
| Sorting result       |  0.000051 |
| Sending data         | 11.443096 |
| end                  |  0.000471 |
| query end            |  0.000278 |
| closing tables       |  0.001001 |
| freeing items        |  0.001249 |
| logging slow query   |  0.000178 |
| logging slow query   |  0.000853 |
| cleaning up          |  0.000245 |
+----------------------+-----------+

MySQL slow query logs shows:

# Query_time: 11.449220  Lock_time: 0.001186 Rows_sent: 54  Rows_examined: 340139
SET timestamp=1455881044;
SELECT  SQL_NO_CACHE posts.id, posts.p_id, posts.type, posts.message,
        posts.eng_rate, posts.created_time, posts.mention_or_post
    FROM  posts
    JOIN  stats ON stats.TW_ID = posts.TW_ID
    WHERE  stats.category = 'hotel'
      AND  posts.mention_or_post = 'tweet'
      AND  posts.eng_rate > 1
      AND  posts.in_reply_to = 0
      AND  posts.created_time > 1453161930
    ORDER BY  eng_rate DESC
    LIMIT  0, 1000; 

Notice the rows examined, 340139

Explain on the query:

+----+-------------+----------+-------+------------------------------------+---------------+---------+----------------------------+-------+-------------+
| id | select_type | table    | type  | possible_keys                      | key           | key_len | ref                        | rows  | Extra       |
+----+-------------+----------+-------+------------------------------------+---------------+---------+----------------------------+-------+-------------+
|  1 | SIMPLE      | posts    | range | c_id,mention_or_post,scraper_index | scraper_index | 36      | NULL                       | 69801 | Using where |
|  1 | SIMPLE      | stats    | ref   | tw_id                              | tw_id         | 8       | bacon.posts.tw_id          |     1 | Using where |
+----+-------------+----------+-------+------------------------------------+---------------+---------+----------------------------+-------+-------------+

Rows 69801 with the "scraper_index". OK so it seems Explain is estimating and the real results differs heavily from the real rows examined 340139, probably due to the joining part in the query.

How is the index structured:

+----------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+--------------------------------------------------------------------------------+
| Table    | Non_unique | Key_name        | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment                                                                  |
+----------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+--------------------------------------------------------------------------------+
| posts    |          0 | PRIMARY         |            1 | p_id            | A         |     4276539 |     NULL | NULL   |      | BTREE      |         |                                                                                |
| posts    |          1 | c_id            |            1 | tw_id           | A         |        1478 |     NULL | NULL   |      | BTREE      |         |                                                                                |
| posts    |          1 | mention_or_post |            1 | mention_or_post | A         |           3 |     NULL | NULL   |      | BTREE      |         |                                                                                |
| posts    |          1 | scraper_index   |            1 | in_reply_to     | A         |      610934 |     NULL | NULL   | YES  | BTREE      |         |                                                                                |
| posts    |          1 | scraper_index   |            2 | mention_or_post | A         |      610934 |     NULL | NULL   |      | BTREE      |         |                                                                                |
| posts    |          1 | scraper_index   |            3 | eng_rate        | A         |      610934 |     NULL | NULL   |      | BTREE      |         |                                                                                |
+----------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+--------------------------------------------------------------------------------+

The scraper_index is built upon and ordered for aiming to get the smallest row results when running Explain on the query. So I've tried with different columns and order to get the best result with Explain. (E.g. created_time has the same carnality as number of rows in the table)

So even with the index the query is slow. But what if I force my query to not use any indexes at all (for table posts)? The result:

# Query_time: 12.959047  Lock_time: 0.010025 Rows_sent: 54  Rows_examined: 4385473
SET timestamp=1455883033;
SELECT  SQL_NO_CACHE posts.id, posts.p_id, posts.type, posts.message,
        posts.eng_rate, posts.created_time, posts.mention_or_post
    FROM  posts IGNORE INDEX (scraper_index, mention_or_post)
    JOIN  stats ON stats.TW_ID = posts.TW_ID
    WHERE  stats.category = 'hotel'
      AND  posts.mention_or_post = 'tweet'
      AND  posts.eng_rate > 1
      AND  posts.in_reply_to = 0
      AND  posts.created_time > 1453161930
    ORDER BY  eng_rate DESC
    LIMIT  0, 1000; 

A bit slower query time but with now with row examined of 4385473.
To me this tells me that the index 'scraper_index' is so inefficient that it's nearly as slow as not having a index at all.

Posts-table short status: MyISAM engine, 4276539 rows, 1Gb data.
Stats-table short status: MyISAM engine, 1506 rows, 32Mb data.

Table structure for posts:

+-----------------+-------------+------+-----+---------+-------+
| Field           | Type        | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| tw_id           | bigint(20)  | NO   | MUL | NULL    |       |
| p_id            | varchar(50) | NO   | PRI | NULL    |       |
| created_time    | int(11)     | NO   |     | NULL    |       |
| message         | text        | NO   |     | NULL    |       |
| type            | varchar(10) | YES  |     | NULL    |       |
| expanded_url    | text        | YES  |     | NULL    |       |
| eng_rate        | int(11)     | NO   |     | NULL    |       |
| in_reply_to     | bigint(20)  | YES  | MUL | NULL    |       |
| user_id         | varchar(20) | YES  |     | NULL    |       |
| mention_or_post | varchar(7)  | NO   | MUL | NULL    |       |
+-----------------+-------------+------+-----+---------+-------+

Is my index build wrong for the query?

– Update 19/2

Changes to 'scraper_index' based from the community's suggestions. Replaced eng_rate with created_time. Improved query time with ~2-3 seconds. The index now looks like this:

+----------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+--------------------------------------------------------------------------------+
| Table    | Non_unique | Key_name        | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment                                                                  |
+----------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+--------------------------------------------------------------------------------+
| posts    |          0 | PRIMARY         |            1 | p_id            | A         |     4276539 |     NULL | NULL   |      | BTREE      |         |                                                                                |
| posts    |          1 | c_id            |            1 | tw_id           | A         |        1478 |     NULL | NULL   |      | BTREE      |         |                                                                                |
| posts    |          1 | mention_or_post |            1 | mention_or_post | A         |           3 |     NULL | NULL   |      | BTREE      |         |                                                                                |
| posts    |          1 | scraper_index   |            1 | in_reply_to     | A         |      610934 |     NULL | NULL   | YES  | BTREE      |         |     |
| posts    |          1 | scraper_index   |            2 | mention_or_post | A         |      610934 |     NULL | NULL   |      | BTREE      |         |     |
| posts    |          1 | scraper_index   |            3 | created_time    | A         |     4276539 |     NULL | NULL   |      | BTREE      |         |    
+----------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+--------------------------------------------------------------------------------+

Slow query log shows:

# Query_time: 8.945486  Lock_time: 0.000261 Rows_sent: 54  Rows_examined: 202192
SET timestamp=1455891107;
SELECT  SQL_NO_CACHE posts.id, posts.p_id, posts.type, posts.message,
        posts.eng_rate, posts.created_time, posts.mention_or_post
    FROM  posts
    JOIN  stats ON stats.TW_ID = posts.TW_ID
    WHERE  stats.category = 'hotel'
      AND  posts.mention_or_post = 'tweet'
      AND  posts.eng_rate > 1
      AND  posts.in_reply_to = 0
      AND  posts.created_time > 1453161930
    ORDER BY  eng_rate DESC; 

Explain with the new index:

+----+-------------+----------+-------+------------------------------------+---------------+---------+----------------------------+-------+-----------------------------+
| id | select_type | table    | type  | possible_keys                      | key           | key_len | ref                        | rows  | Extra                       |
+----+-------------+----------+-------+------------------------------------+---------------+---------+----------------------------+-------+-----------------------------+
|  1 | SIMPLE      | posts    | range | c_id,mention_or_post,scraper_index | scraper_index | 36      | NULL                       | 40152 | Using where; Using filesort |
|  1 | SIMPLE      | stats    | ref   | tw_id                              | tw_id         | 8       | bacon.posts.tw_id          |     1 | Using where                 |
+----+-------------+----------+-------+------------------------------------+---------------+---------+----------------------------+-------+-----------------------------+

– end of update

– Update 22/2

Tonight's run with the specific problem query above in the MySQL slow query log shows that the problem still persists and the query time is high, 9-12 seconds. However no significant improvements was expected from earlier changes.

Today when I was profiling the query I got unexpected fast query results (less than 0.5 seconds) which lead me to suspect it had to do something with the cache even if I've manually done the query with SQL_NO_CACHE all time. Maybe it was some sort of disk cache, I don't know but I couldn't reproduce slower query time as before so I've to see how the changes will work in real situation next run.

Modifying and creating new indexes to Rick James suggestion. So the modified index looks like this:

posts:  INDEX(mention_or_post, in_reply_to, created_time) -- Named scraper_index
stats:  INDEX(TW_ID, category) -- Named tw_id

And the new index:

stats:  INDEX(category, TW_ID)  -- Named test2
posts:  INDEX(TW_ID, mention_or_post, in_reply_to, created_time) -- Named test1

The query optimizer is choosing the new indexes "test2" and "test1" as prefered indexes to use. Explain tells:

+----+-------------+----------+------+------------------------------------------+-------+---------+----------------------------------------+------+-----------------------------------------------------------+
| id | select_type | table    | type | possible_keys                            | key   | key_len | ref                                    | rows | Extra                                                     |
+----+-------------+----------+------+------------------------------------------+-------+---------+----------------------------------------+------+-----------------------------------------------------------+
|  1 | SIMPLE      | stats    | ref  | tw_id,test2                              | test2 | 92      | const                                  |   44 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | posts    | ref  | c_id,scraper_index,mention_or_post,test1 | test1 | 40      | bacon.stats.TW_ID,const,const          |    7 | Using where                                               |
+----+-------------+----------+------+------------------------------------------+-------+---------+----------------------------------------+------+-----------------------------------------------------------+

That looks very promising.

Running the query first time:

+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000080 |
| checking permissions | 0.000004 |
| checking permissions | 0.000004 |
| Opening tables       | 0.000020 |
| System lock          | 0.000010 |
| init                 | 0.000039 |
| optimizing           | 0.000018 |
| statistics           | 0.000421 |
| preparing            | 0.000119 |
| Creating tmp table   | 0.006901 |
| executing            | 0.000006 |
| Copying to tmp table | 0.008742 |
| Sorting result       | 0.000040 |
| Sending data         | 0.000138 |
| end                  | 0.000005 |
| removing tmp table   | 0.000093 |
| end                  | 0.000004 |
| query end            | 0.000003 |
| closing tables       | 0.000009 |
| freeing items        | 0.000323 |
| logging slow query   | 0.000003 |
| logging slow query   | 0.000025 |
| cleaning up          | 0.000003 |
+----------------------+----------+

# Query_time: 0.016658  Lock_time: 0.000108 Rows_sent: 62  Rows_examined: 2933
SET timestamp=1456160782;
SELECT SQL_NO_CACHE posts.id, posts.p_id, posts.type, posts.message, posts.eng_rate, posts.created_time, posts.mention_or_post
FROM posts
JOIN stats
ON stats.TW_ID = posts.TW_ID
WHERE stats.category = 'hotel'
AND posts.mention_or_post = 'tweet'
AND posts.eng_rate > 1
AND posts.in_reply_to = 0
AND posts.created_time > 1453161930
ORDER BY eng_rate DESC;

Running the same query again

+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000075 |
| checking permissions | 0.000005 |
| checking permissions | 0.000004 |
| Opening tables       | 0.000021 |
| System lock          | 0.000009 |
| init                 | 0.000038 |
| optimizing           | 0.000018 |
| statistics           | 0.000237 |
| preparing            | 0.000018 |
| Creating tmp table   | 0.000158 |
| executing            | 0.000003 |
| Copying to tmp table | 0.008009 |
| Sorting result       | 0.000040 |
| Sending data         | 0.000108 |
| end                  | 0.000002 |
| removing tmp table   | 0.000131 |
| end                  | 0.000004 |
| query end            | 0.000005 |
| closing tables       | 0.000009 |
| freeing items        | 0.000014 |
| logging slow query   | 0.000002 |
| logging slow query   | 0.000276 |
| cleaning up          | 0.000003 |
+----------------------+----------+

# Query_time: 0.008897  Lock_time: 0.000102 Rows_sent: 62  Rows_examined: 2933
SET timestamp=1456160909;
SELECT SQL_NO_CACHE posts.id, posts.p_id, posts.type, posts.message, posts.eng_rate, posts.created_time, posts.mention_or_post
FROM posts
JOIN stats
ON stats.TW_ID = posts.TW_ID
WHERE stats.category = 'hotel'
AND posts.mention_or_post = 'tweet'
AND posts.eng_rate > 1
AND posts.in_reply_to = 0
AND posts.created_time > 1453161930
ORDER BY eng_rate DESC;

So if there are no cache bullshitting me, this really looks promising. I've to check tomorrow what the "real" affects are.

– end of update


Maybe my MYSQL-server memory configuration isn't enough for this query. My config in short:

key_buffer_size         = 64M
table_open_cache        = 512
sort_buffer_size        = 5M
read_buffer_size        = 5M
read_rnd_buffer_size    = 512K
net_buffer_length       = 8K
myisam_sort_buffer_size = 8M
join_buffer_size        = 1M
thread_cache_size       = 4
max_heap_table_size     = 50M
tmp_table_size          = 50M
query_cache_limit       = 15M
query_cache_size        = 60M

Is the memory limit a bottleneck for this query?
A result of 340K examined rows out of 4,3M potential rows – How much memory is needed to have the index as efficient as possible?

Does anyone have some good suggestions what to do next?

Partitioning the table "posts" for less rows may be a solution to improve the query time and it's possible for me to partition the table but in the feature the "fresh" data table may contain millions of rows anyhow so then it's only a temporary solution which is not desirable.

Best Answer

In case it is better to start with stats:

stats:  INDEX(category, TW_ID)  -- "covering"
posts:  INDEX(TW_ID, mention_or_post, in_reply_to, created_time)

In case create_time is a better filter than eng_rate:

posts:  INDEX(mention_or_post, in_reply_to, created_time)
stats:  INDEX(TW_ID, category)  -- "covering"

For posts, start with the columns compared by '=' (in any order), then move on to one "range" -- either eng_rate or created_time

"Flags" (mention_or_post) are almost never worth indexing by themselves. If it is a flag, use either TINYINT or ENUM, not a bulky VARCHAR(7).

"To me this tells me that the index 'scraper_index' is so inefficient that it's nearly as slow as not having a index at all." -- Not necessarily. It could be that a bunch of index and/or data blocks were cached, thereby invalidating the conclusion. Run each query twice and take the second timing. (And either have the Query cache turned off, or include SQL_NO_CACHE.)

(Please provide SHOW CREATE TABLE, it is more descriptive than DESCRIBE.)

More on creating indexes.

What the heck is in stats? "1506 rows, 32Mb data" implies a lot of stuff.

When will you be moving to InnoDB.