MySQL query not using an index when table contains many records

indexMySQL

I have a table:

CREATE TABLE `p` (  
    `id` BIGINT(20) unsigned NOT NULL,  
    `rtime` DATETIME NOT NULL,  
    `d` INT(10) NOT NULL,  
    `n` INT(10) NOT NULL,  
    PRIMARY KEY (`rtime`,`id`,`d`) USING BTREE  
) ENGINE=MyISAM DEFAULT CHARSET=latin1;  

A query that I run against this table is:

SELECT id, d, SUM(n) 
FROM p 
WHERE rtime BETWEEN '2012-08-25' AND DATE(now()) 
GROUP BY id, d;

I'm running explain on this query on a tiny table (2 records) and it tells me it's going to use my primary key index:

id  | select_type  | table | type   | possible_keys | key     | key_len | ref  | rows | Extra
1   | SIMPLE       | p     | range  | PRIMARY       | PRIMARY | 8       | NULL | 1    | Using where; Using temporary; Using filesort

When I run this query on the same table with 350 million records – it prefers to go through all the records and ignore my keys:

id  | select_type  | table  | type | possible_keys  | key  | key_len | ref  | rows      | Extra
1   | SIMPLE       | p      | ALL  | PRIMARY        | NULL | NULL    | NULL | 355465280 | Using where; Using temporary; Using filesort

Obviously, this is extremely slow.

I've tried losing the GROUP BY and even changed the BETWEEN to a simple '>', but it still won't use the key.

The only time I got it to use the key was when I used 'rtime = ..'.

I should note that all this data is actually only going about a week back, so when I try to get rtime > 3-days-ago I'm expecting a significant chunk from those 350 million records.

This query is supposed to run every 15 minutes, so the current 30-40 minute execution time is definitely unacceptable.

How can I construct the query such that it uses the indexes, or how should I index the table in order to get the fastest performance?

Best Answer

The table size is not the villain. It's the estimated number of rows.

The query optimizer, in this case (MyISAM, key starting with rtime, etc), will do something like this:

  1. estimate the percentage of the table to scan, based on "WHERE rtime BETWEEN..."
  2. If that is "small" (say, less than 20%, but that is not a hard number), use the INDEX; else do a table scan.

Step 1 depends on the "statistics" that are kept with the MyISAM table. The stats are usually pretty accurate, but they can become less accurate. ANALYZE TABLE is the fix for that. (I don't think I have ever seen a need for ANALYZE being run more than monthly; usually it is not needed at all.)

The reason for the to-INDEX-or-not-to-INDEX question goes something like this... When using the INDEX, the execution has to bounce between index 'rows' and data rows. The data rows are potentially randomly scattered, leading to (potentially) lots of I/O. Hence, doing a 'table scan' is preferred after some point.