I'm wondering if someone could help me optimize my tables/query to speed up a query. It is currently running ridiculously slow. I think a well-thought out index could help me. Any help would be really appreciated
Tables URLS and TAGS mentioned below are 2 and 20 million rows respectively (will probably end up having 10x). A query like the one below already takes 10 seconds to run.
An Example:
http://whatrethebest.com/php+tutorials
Tables
CREATE TABLE IF NOT EXISTS `TAGS` (
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `my_unique_key` (`hash`,`tag`),
KEY `tag` (`tag`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and
CREATE TABLE IF NOT EXISTS `URLS` (
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`md5`),
UNIQUE KEY `md5` (`md5`),
KEY `numsaves` (`numsaves`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
QUERY
SELECT urls.md5, urls.url, urls.title, urls.numsaves
FROM urls
JOIN tags ON urls.md5 = tags.hash
WHERE tags.tag
IN (
'php', 'tutorials'
)
GROUP BY urls.md5
HAVING COUNT( * ) =2
ORDER BY urls.numsaves DESC
LIMIT 20
EXPLAIN
I'm not sure what this shows
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tags range my_unique_key,tag tag 767 NULL 230946 Using where; Using index; Using temporary; Using filesort
1 SIMPLE urls eq_ref PRIMARY,md5 PRIMARY 767 jcooper_whatrethebest_urls.tags.hash 1
So I think the problem is:
certain tags like 'php have 34,000 entries, most of which only have under 5 saves. But in order to get the 20 most saved it is having to sort them all.Right?
I can't really create a 'numsaves' column in TAGS and index on that because that number will be changing up and down, and that wouldnt make sense. Is it possible to create a cross-table index between urls.numsaves and tags.tag? Or a third table to use in my query somehow? Would this solve my problem? I know almost nothing about indexing.
Any help would be really appreciated!
EDITS BELOW
RESPONSE TO YperCube:
Thank you, Thank you, your suggestions have sped up my queries by a factor of 10-20X .
This is an immense improvement. I can't thank you enough.
I'm posting my current queries and tables with execution times in case you or anyone else has any more optimization suggestions. I am worried that as my table grows I may not be able to keep my search times under 3 seconds, which would be a killer.
New Query Example 1
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON t2.url_id = u.id
AND t2.tag = 'language'
ORDER BY u.numsaves DESC
LIMIT 20
Showing rows 20 - 19 ( 20 total, Query took 0.2009 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t2 ref tag_id tag_id 767 const 53820 Using where; Using index; Using temporary; Using filesort
1 SIMPLE t1 ref tag_id tag_id 772 const,jcooper_whatrethebest_urls.t2.url_id 1 Using where; Using index
1 SIMPLE u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t2.url_id 1
Neq Query Example 2 (seems to be slower)
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN
( SELECT ui.id, ui.numsaves
FROM urls AS ui
JOIN tags AS t1 ON t1.url_id = ui.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON t2.url_id = ui.id
AND t2.tag = 'language'
ORDER BY ui.numsaves DESC
LIMIT 20
) AS ulim ON ulim.id = u.id
ORDER BY ulim.numsaves DESC ;
Showing rows 0 - 29 ( 2,794,577 total, Query took 0.4633 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 20 Using filesort
1 PRIMARY u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 ulim.id 1
2 DERIVED t2 ref tag_id tag_id 767 53820 Using where; Using index; Using temporary; Using filesort
2 DERIVED t1 ref tag_id tag_id 772 jcooper_whatrethebest_urls.t2.url_id 1 Using where; Using index
2 DERIVED ui eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t2.url_id 1
Using Query Example on a Single Tag (slower by a lot)
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
ORDER BY u.numsaves DESC
LIMIT 20
Showing rows 20 - 19 ( 20 total, Query took 3.7395 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 ref tag_id tag_id 767 const 200576 Using where; Using index; Using temporary; Using filesort
1 SIMPLE u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t1.url_id 1
I'm not sure why this one is so much slower?
Do you have any ideas of a query to optimize for querying a single tag?
My Current Tables
CREATE TABLE `urls` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `md5` (`md5`),
KEY `id_numsaves_IX` (`id`,`numsaves`)
) ENGINE=InnoDB AUTO_INCREMENT=2958560 DEFAULT CHARSET=utf8
CREATE TABLE `tags` (
`url_id` int(11) DEFAULT NULL,
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `tag_id` (`tag`,`url_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Thank you again
Best Answer
Add an index on
(tag, hash)
Then try this query:
and after adding another index:
try this variation (explanation below, at point 3):
Other things you should consider:
URLS (md5)
, one Primary and one Unique. You don't need both, you can safely delete the redundant (unique) index.Why was
md5
was chosen as the primary key in the first place? Why not a surrogate (auto-incrementing) integer column? That is usually a better choice for primary key in InnODB tables because the primary key is also by default the clustered index of the table.If you make that change (add an
url_id
column and make it PK), you will also need to alter theTAGS
table by adding aurl_id
as well and removing thehash
column. This would help efficiency, too. Indexes onint
columns take much less space than indexes onVARCHAR(255)
columns. 4 bytes vs. 757 bytes wide is a lot of difference.Read this blog post about text and blob columns in InnoDB: Handling long texts/blobs in InnoDB If the issue turns out to be the text columns, you could split the table into two tables, having an 1-1 relationship between them, so text columns are separately from the narrow columns (or try the index I suggested, which is a similar fix for this.)