Mysql – Can MySQL use an index to calculate SUM

innodbMySQL

The following query is run in a MySQL database (all InnoDB tables, Percona 5.5), and is slow for some parameter values (i.e. for some large customers). It returns the total size of all files uploaded by a given customer:

SELECT SUM(document.size)
  FROM document
       INNER JOIN post ON post.id = document.id
       INNER JOIN page ON page.id = post.page_id
 WHERE page.customer_id=1234;

I tried denormalization — added document.customer_id column and index on document(customer_id, size) — and got much better query times. But I'd like to avoid denormalization if possible. I also tried adding just an index on document(size), but couldn't get MySQL to use it. The documentation states that every index in an InnoDB table also contains references to the primary key, so why did MySQL not use this new index when joining, and save itself the need to load document row data?

The schema looks like:

CREATE TABLE page (
  INT id NOT NULL,
  INT customer_id NOT NULL,
  ...
  PRIMARY KEY (id),
  INDEX ix_page_customer (customer_id)
  FOREIGN KEY (customer_id) REFERENCES customer (id)
)

CREATE TABLE post (
  INT id NOT NULL,
  ...
  PRIMARY KEY (id),
  INDEX ix_post_page (page_id),
  FOREIGN KEY (page_id) REFERENCES page (id)
)

CREATE TABLE document (
  INT id NOT NULL,
  INT size NOT NULL,
  ...
  PRIMARY KEY (id),
  FOREIGN KEY (id) REFERENCES post (id)
)

A customer contains many pages, a page contains many posts. Posts can be of different types (e.g. message, document). There are tables for each of these post subtypes, of which document is one. One of the larger customers has 435K posts (of which 8K are documents) spread over 25K pages.

The query is fairly fast for smaller customers (i.e lower pages*posts), but slow for larger customers. Here's EXPLAIN output for the query for a large customer:

+----+-------------+-----------+--------+--------------------------+------------------+---------+-----------------------+-------+-------------+
| id | select_type | table     | type   | possible_keys            | key              | key_len | ref                   | rows  | Extra       |
+----+-------------+-----------+--------+--------------------------+------------------+---------+-----------------------+-------+-------------+
|  1 | SIMPLE      | page      | ref    | PRIMARY,ix_page_customer | ix_page_customer | 4       | const                 | 51200 | Using index |
|  1 | SIMPLE      | post      | ref    | PRIMARY,ix_post_page     | ix_post_page     | 4       | database.page.page_id |     2 | Using index |
|  1 | SIMPLE      | document  | eq_ref | PRIMARY                  | PRIMARY          | 4       | database.post.post_id |     1 |             |
+----+-------------+-----------+--------+--------------------------+------------------+---------+-----------------------+-------+-------------+

Best Answer

Please note that Extra: Using Index doesn't mean that an index is used- you know that because of the columns type:eq_ref and key:PRIMARY. Using index in reality means that you are using the Covering index optimisation to access data.

In your case, you are using and index for all 3 table accesses. If you wanted also the Covering index optimisation, you could create an index on (document.id, document.size), but unless that table is very large and this query is very frequent, you would not see a big difference (in cases where you are disk bound). The index on (document.size), effectively, contains the primary key, but it cannot be used, because it is hidden after the size column. In order to be used for filtering, it must in the first place. Also, in MySQL/Percona 5.5, the hidden primary key can only be used for certain sorting and covering index operations, not for filtering. This last limitation is solved in MySQL/Percona 5.6.

You biggest problem is that your query has to read 50k rows and add them (an index won't solve that). With the above tricks you can make the calculations mostly in memory (assuming a big enough buffer pool), but you still have to sum every single value. You could maintain a duplicated customer_id on document -if that makes sense (that is denormalizing, too)- to avoid 2 table accesses.

But your initial idea of denormalizing, maintaining the value precalculated -can be maintained with a trigger-, or using any other kind of cache -query cache, with all its problems, external cache like memcache, etc.- is probably the way to go if you cannot handle such latencies.