MySQL join two large tables is very slow

join;MySQLperformancequery-performance

I have two tables, one of them contains a history of downloaded urls, while the other table contains details about each url.

The following query groups the URLs by the number of repetitions during the past hour.

SELECT COUNT(history.url) as total, history.url
FROM history
WHERE history.time > UNIX_TIMESTAMP()-3600
GROUP BY history.url
ORDER BY COUNT(history.url) DESC 
LIMIT 30

The query above takes about 800ms to execute, not fast enough, but acceptable,

however, when joining it with the cache table, the new query takes about 25s to execute, which is very slow.

SELECT th.total, th.url, tc.url, tc.json 
FROM (SELECT COUNT(history.url) as total, history.url
      FROM history 
      WHERE history.time > UNIX_TIMESTAMP()-3600
      GROUP BY history.url
      ORDER BY COUNT(history.url) DESC 
      LIMIT 30
) th
INNER JOIN (SELECT cache.url, cache.json FROM cache) tc
    ON th.url = tc.url
GROUP BY th.url
ORDER BY th.total DESC
LIMIT 30

I think that this might be happening because in 'tc', the whole cache table is being loaded, and it contains 1million+ entries.

When I use the first query, and then programmatically iterate the results and then run a SELECT query from cache for each result, it is much faster. Is there anyway to speed up my second query?

P.S. I'm using InnoDB

UPDATE
The output of the second query with EXPLAIN
enter image description here

The structure of 'history' table
enter image description here

The structure of 'cache' table
enter image description here

Best Answer

It is generally a good idea to index columns that participate in JOIN predicates or WHERE clauses. A common mistake is to create several one column indexes instead of fewer multi-column indexes. Here we can benefit from both url and time in history (look at all queries against these tables and you may find that you can add additional columns to these indexes):

CREATE INDEX x01_history_url ON HISTORY (URL, TIME);
CREATE INDEX x01_cache_url ON CACHE (URL);

Second, try to unnest the queries. MySQL has limitations on what kind of query rewrites it is able to do, so the nesting may cause unnecessary overhead.

SELECT COUNT(th.url) as total, tc.url, tc.json 
FROM history th
JOIN cache tc
    ON th.url = tc.url
WHERE th.time > UNIX_TIMESTAMP()-3600
GROUP BY tc.url, tc.json
ORDER BY COUNT(th.url) DESC 
LIMIT 30

Note that this query is semantically different from your query so you might get a different result. If that is a problem you may want to keep the LIMIT 30 construction in a subquery as before. You may also consider if you can add a similar limit to CACHE, is there an upper bound on how many CACHE rows you must investigate to get 30 rows in total?

INNER JOIN (SELECT cache.url, cache.json 
            FROM cache
            ORDER BY ? LIMIT ?) tc