Thesql: need help to optimize the query/table

indexMySQLoptimization

I'm wondering if someone could help me optimize my tables/query to speed up a query. It is currently running ridiculously slow. I think a well-thought out index could help me. Any help would be really appreciated

Tables URLS and TAGS mentioned below are 2 and 20 million rows respectively (will probably end up having 10x). A query like the one below already takes 10 seconds to run.

An Example:
http://whatrethebest.com/php+tutorials

Tables

CREATE TABLE IF NOT EXISTS `TAGS` (
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `my_unique_key` (`hash`,`tag`),
KEY `tag` (`tag`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

and

CREATE TABLE IF NOT EXISTS `URLS` (
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`md5`),
UNIQUE KEY `md5` (`md5`),
KEY `numsaves` (`numsaves`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

QUERY

SELECT urls.md5, urls.url, urls.title, urls.numsaves
FROM urls
JOIN tags ON urls.md5 = tags.hash
WHERE tags.tag
IN (
'php', 'tutorials'
)
GROUP BY urls.md5
HAVING COUNT( * ) =2
ORDER BY urls.numsaves DESC
LIMIT 20

EXPLAIN

I'm not sure what this shows

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  tags    range   my_unique_key,tag   tag     767     NULL    230946  Using where; Using index; Using temporary; Using filesort
1   SIMPLE  urls    eq_ref  PRIMARY,md5     PRIMARY     767     jcooper_whatrethebest_urls.tags.hash    1   

So I think the problem is:

certain tags like 'php have 34,000 entries, most of which only have under 5 saves. But in order to get the 20 most saved it is having to sort them all.Right?

I can't really create a 'numsaves' column in TAGS and index on that because that number will be changing up and down, and that wouldnt make sense. Is it possible to create a cross-table index between urls.numsaves and tags.tag? Or a third table to use in my query somehow? Would this solve my problem? I know almost nothing about indexing.

Any help would be really appreciated!


EDITS BELOW

RESPONSE TO YperCube:

Thank you, Thank you, your suggestions have sped up my queries by a factor of 10-20X .
This is an immense improvement. I can't thank you enough.

I'm posting my current queries and tables with execution times in case you or anyone else has any more optimization suggestions. I am worried that as my table grows I may not be able to keep my search times under 3 seconds, which would be a killer.

New Query Example 1

SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON t2.url_id = u.id
AND t2.tag = 'language'
ORDER BY u.numsaves DESC
LIMIT 20 

Showing rows 20 - 19 ( 20 total, Query took 0.2009 sec) 

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  t2  ref     tag_id  tag_id  767     const   53820   Using where; Using index; Using temporary; Using filesort
1   SIMPLE  t1  ref     tag_id  tag_id  772     const,jcooper_whatrethebest_urls.t2.url_id  1   Using where; Using index
1   SIMPLE  u   eq_ref  PRIMARY,id_numsaves_IX  PRIMARY     4   jcooper_whatrethebest_urls.t2.url_id    1   

Neq Query Example 2 (seems to be slower)

SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN 
( SELECT ui.id, ui.numsaves
FROM urls AS ui
JOIN tags AS t1 ON  t1.url_id = ui.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON  t2.url_id = ui.id
AND t2.tag = 'language'
ORDER BY ui.numsaves DESC
LIMIT 20
) AS ulim ON ulim.id = u.id
ORDER BY ulim.numsaves DESC ;

Showing rows 0 - 29 ( 2,794,577 total, Query took 0.4633 sec)

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   PRIMARY     <derived2>  ALL     NULL    NULL    NULL    NULL    20  Using filesort
1   PRIMARY     u   eq_ref  PRIMARY,id_numsaves_IX  PRIMARY     4   ulim.id     1   
2   DERIVED     t2  ref     tag_id  tag_id  767         53820   Using where; Using index; Using temporary; Using filesort
2   DERIVED     t1  ref     tag_id  tag_id  772     jcooper_whatrethebest_urls.t2.url_id    1   Using where; Using index
2   DERIVED     ui  eq_ref  PRIMARY,id_numsaves_IX  PRIMARY     4   jcooper_whatrethebest_urls.t2.url_id    1   

Using Query Example on a Single Tag (slower by a lot)

SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
ORDER BY u.numsaves DESC
LIMIT 20 

Showing rows 20 - 19 ( 20 total, Query took 3.7395 sec)

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  t1  ref     tag_id  tag_id  767     const   200576  Using where; Using index; Using temporary; Using filesort
1   SIMPLE  u   eq_ref  PRIMARY,id_numsaves_IX  PRIMARY     4   jcooper_whatrethebest_urls.t1.url_id    1   

I'm not sure why this one is so much slower?

Do you have any ideas of a query to optimize for querying a single tag?

My Current Tables

CREATE TABLE `urls` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `md5` (`md5`),
KEY `id_numsaves_IX` (`id`,`numsaves`)
) ENGINE=InnoDB AUTO_INCREMENT=2958560 DEFAULT CHARSET=utf8

CREATE TABLE `tags` (
`url_id` int(11) DEFAULT NULL,
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `tag_id` (`tag`,`url_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Thank you again

Best Answer

Add an index on (tag, hash)

ALTER TABLE tags
  DROP INDEX tag,
  ADD UNIQUE INDEX tag_hash_UX
    (tag, `hash`) ;

Then try this query:

SELECT u.md5, u.url, u.title, u.numsaves
FROM urls AS u
  JOIN tags AS t1 ON  t1.hash = u.md5
                  AND t1.tag = 'php'
  JOIN tags AS t2 ON  t2.hash = u.md5
                  AND t2.tag = 'tutorials'
ORDER BY u.numsaves DESC
  LIMIT 20 ;

and after adding another index:

ALTER TABLE urls
  DROP INDEX md5,
  ADD INDEX md5_numsaves_IX
    (md5, numsaves) ;

try this variation (explanation below, at point 3):

SELECT u.md5, u.url, u.title, u.numsaves
FROM urls AS u
  JOIN 
    ( SELECT ui.md5, ui.numsaves
      FROM urls AS ui
        JOIN tags AS t1 ON  t1.hash = ui.md5
                  AND t1.tag = 'php'
        JOIN tags AS t2 ON  t2.hash = ui.md5
                  AND t2.tag = 'tutorials'
      ORDER BY ui.numsaves DESC
        LIMIT 20
    ) AS ulim ON ulim.md5 = u.md5
ORDER BY ulim.numsaves DESC ;

Other things you should consider:

  • You have 2 identical indexes on URLS (md5), one Primary and one Unique. You don't need both, you can safely delete the redundant (unique) index.
  • Why was md5 was chosen as the primary key in the first place? Why not a surrogate (auto-incrementing) integer column? That is usually a better choice for primary key in InnODB tables because the primary key is also by default the clustered index of the table.

    If you make that change (add an url_id column and make it PK), you will also need to alter the TAGS table by adding a url_id as well and removing the hash column. This would help efficiency, too. Indexes on int columns take much less space than indexes on VARCHAR(255) columns. 4 bytes vs. 757 bytes wide is a lot of difference.

  • Read this blog post about text and blob columns in InnoDB: Handling long texts/blobs in InnoDB If the issue turns out to be the text columns, you could split the table into two tables, having an 1-1 relationship between them, so text columns are separately from the narrow columns (or try the index I suggested, which is a similar fix for this.)