Thesql: need help to optimize the query/table

indexMySQLoptimization

I'm wondering if someone could help me optimize my tables/query to speed up a query. It is currently running ridiculously slow. I think a well-thought out index could help me. Any help would be really appreciated

Tables URLS and TAGS mentioned below are 2 and 20 million rows respectively (will probably end up having 10x). A query like the one below already takes 10 seconds to run.

An Example:
http://whatrethebest.com/php+tutorials

Tables

CREATE TABLE IF NOT EXISTS `TAGS` (
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `my_unique_key` (`hash`,`tag`),
KEY `tag` (`tag`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

and

CREATE TABLE IF NOT EXISTS `URLS` (
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`md5`),
UNIQUE KEY `md5` (`md5`),
KEY `numsaves` (`numsaves`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

QUERY

SELECT urls.md5, urls.url, urls.title, urls.numsaves
FROM urls
JOIN tags ON urls.md5 = tags.hash
WHERE tags.tag
IN (
'php', 'tutorials'
)
GROUP BY urls.md5
HAVING COUNT( * ) =2
ORDER BY urls.numsaves DESC
LIMIT 20

EXPLAIN

I'm not sure what this shows

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  tags    range   my_unique_key,tag   tag     767     NULL    230946  Using where; Using index; Using temporary; Using filesort
1   SIMPLE  urls    eq_ref  PRIMARY,md5     PRIMARY     767     jcooper_whatrethebest_urls.tags.hash    1

So I think the problem is:

certain tags like 'php have 34,000 entries, most of which only have under 5 saves. But in order to get the 20 most saved it is having to sort them all.Right?

I can't really create a 'numsaves' column in TAGS and index on that because that number will be changing up and down, and that wouldnt make sense. Is it possible to create a cross-table index between urls.numsaves and tags.tag? Or a third table to use in my query somehow? Would this solve my problem? I know almost nothing about indexing.

Any help would be really appreciated!

EDITS BELOW

RESPONSE TO YperCube:

Thank you, Thank you, your suggestions have sped up my queries by a factor of 10-20X .
This is an immense improvement. I can't thank you enough.

I'm posting my current queries and tables with execution times in case you or anyone else has any more optimization suggestions. I am worried that as my table grows I may not be able to keep my search times under 3 seconds, which would be a killer.

New Query Example 1

SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON t2.url_id = u.id
AND t2.tag = 'language'
ORDER BY u.numsaves DESC
LIMIT 20 

Showing rows 20 - 19 ( 20 total, Query took 0.2009 sec) 

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  t2  ref     tag_id  tag_id  767     const   53820   Using where; Using index; Using temporary; Using filesort
1   SIMPLE  t1  ref     tag_id  tag_id  772     const,jcooper_whatrethebest_urls.t2.url_id  1   Using where; Using index
1   SIMPLE  u   eq_ref  PRIMARY,id_numsaves_IX  PRIMARY     4   jcooper_whatrethebest_urls.t2.url_id    1

Neq Query Example 2 (seems to be slower)

SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN 
( SELECT ui.id, ui.numsaves
FROM urls AS ui
JOIN tags AS t1 ON  t1.url_id = ui.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON  t2.url_id = ui.id
AND t2.tag = 'language'
ORDER BY ui.numsaves DESC
LIMIT 20
) AS ulim ON ulim.id = u.id
ORDER BY ulim.numsaves DESC ;

Showing rows 0 - 29 ( 2,794,577 total, Query took 0.4633 sec)

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   PRIMARY     <derived2>  ALL     NULL    NULL    NULL    NULL    20  Using filesort
1   PRIMARY     u   eq_ref  PRIMARY,id_numsaves_IX  PRIMARY     4   ulim.id     1   
2   DERIVED     t2  ref     tag_id  tag_id  767         53820   Using where; Using index; Using temporary; Using filesort
2   DERIVED     t1  ref     tag_id  tag_id  772     jcooper_whatrethebest_urls.t2.url_id    1   Using where; Using index
2   DERIVED     ui  eq_ref  PRIMARY,id_numsaves_IX  PRIMARY     4   jcooper_whatrethebest_urls.t2.url_id    1

Using Query Example on a Single Tag (slower by a lot)

SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
ORDER BY u.numsaves DESC
LIMIT 20 

Showing rows 20 - 19 ( 20 total, Query took 3.7395 sec)

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  t1  ref     tag_id  tag_id  767     const   200576  Using where; Using index; Using temporary; Using filesort
1   SIMPLE  u   eq_ref  PRIMARY,id_numsaves_IX  PRIMARY     4   jcooper_whatrethebest_urls.t1.url_id    1

I'm not sure why this one is so much slower?

Do you have any ideas of a query to optimize for querying a single tag?

My Current Tables

CREATE TABLE `urls` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `md5` (`md5`),
KEY `id_numsaves_IX` (`id`,`numsaves`)
) ENGINE=InnoDB AUTO_INCREMENT=2958560 DEFAULT CHARSET=utf8

CREATE TABLE `tags` (
`url_id` int(11) DEFAULT NULL,
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `tag_id` (`tag`,`url_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Thank you again

Best Answer

Add an index on (tag, hash)

ALTER TABLE tags
  DROP INDEX tag,
  ADD UNIQUE INDEX tag_hash_UX
    (tag, `hash`) ;

Then try this query:

SELECT u.md5, u.url, u.title, u.numsaves
FROM urls AS u
  JOIN tags AS t1 ON  t1.hash = u.md5
                  AND t1.tag = 'php'
  JOIN tags AS t2 ON  t2.hash = u.md5
                  AND t2.tag = 'tutorials'
ORDER BY u.numsaves DESC
  LIMIT 20 ;

and after adding another index:

ALTER TABLE urls
  DROP INDEX md5,
  ADD INDEX md5_numsaves_IX
    (md5, numsaves) ;

try this variation (explanation below, at point 3):

SELECT u.md5, u.url, u.title, u.numsaves
FROM urls AS u
  JOIN 
    ( SELECT ui.md5, ui.numsaves
      FROM urls AS ui
        JOIN tags AS t1 ON  t1.hash = ui.md5
                  AND t1.tag = 'php'
        JOIN tags AS t2 ON  t2.hash = ui.md5
                  AND t2.tag = 'tutorials'
      ORDER BY ui.numsaves DESC
        LIMIT 20
    ) AS ulim ON ulim.md5 = u.md5
ORDER BY ulim.numsaves DESC ;

Other things you should consider:

You have 2 identical indexes on URLS (md5), one Primary and one Unique. You don't need both, you can safely delete the redundant (unique) index.
Why was md5 was chosen as the primary key in the first place? Why not a surrogate (auto-incrementing) integer column? That is usually a better choice for primary key in InnODB tables because the primary key is also by default the clustered index of the table.

If you make that change (add an url_id column and make it PK), you will also need to alter the TAGS table by adding a url_id as well and removing the hash column. This would help efficiency, too. Indexes on int columns take much less space than indexes on VARCHAR(255) columns. 4 bytes vs. 757 bytes wide is a lot of difference.
Read this blog post about text and blob columns in InnoDB: Handling long texts/blobs in InnoDB If the issue turns out to be the text columns, you could split the table into two tables, having an 1-1 relationship between them, so text columns are separately from the narrow columns (or try the index I suggested, which is a similar fix for this.)

Best Answer

Related Solutions

Mysql – High Disk IO, How to mitigate

Mysql – Help optimizing MySQL slow query

Related Question