Mysql – How to better optimize the tables and a LEFT JOIN query to find items that don’t yet exist in the right table

database-designMySQLoptimization

I have two tables that represent a list of urls and their related word indexes. Here are the table definitions for reference.

desc urllist;
+-------+---------------------+------+-----+---------+----------------+
| Field | Type                | Null | Key | Default | Extra          |
+-------+---------------------+------+-----+---------+----------------+
| id    | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| url   | text                | NO   |     | NULL    |                |
+-------+---------------------+------+-----+---------+----------------+

and

desc wordlocation;
+----------+---------------------+------+-----+---------+-------+
| Field    | Type                | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+-------+
| urlid    | bigint(20) unsigned | NO   |     | NULL    |       |
| wordid   | bigint(20) unsigned | NO   |     | NULL    |       |
| location | int(10) unsigned    | NO   |     | NULL    |       |
+----------+---------------------+------+-----+---------+-------+

The software app is a web spider. It crawls a list of urls, extracts those urls, and inserts these into the urllist table. Then, an indexer checks to see what urls have not yet been indexed, and then proceeds to index said urls.

Here is the query I am using to find items in the left table (urllist) that have not yet been indexed in the right table (wordlocation). This query is as suggested on the mysql.com website:

select * from urllist ul 
left join wordlocation wl on ul.id = wl.urlid 
where wl.urlid IS NULL;

As of this writing my test database has only 600 indexed urls, and the wordlocation table has 1.3 million rows. However, my CPU is at 100%, and the longest I've waited to see if the query would complete is a half hour (which, it never did by the way).

To be thorough, here is the explanation of the query:

explain select * from urllist ul left join wordlocation wl on ul.id = wl.urlid where wl.urlid IS NULL;
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra                   |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
|  1 | SIMPLE      | ul    | ALL  | NULL          | NULL | NULL    | NULL |   50364 |                         |
|  1 | SIMPLE      | wl    | ALL  | NULL          | NULL | NULL    | NULL | 1351371 | Using where; Not exists |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+

I need this query to finish in seconds, not minutes. Also, I'm worried about scalability. I have 40,000 unique urls that are waiting to be added to the index, so how do I take that into consideration with my table and query design? 400,000 urls?

Just a couple notes about my decisions on the current table structure.

I have no intention of stopping at 400,000 urls, but perhaps bigint(20) is a bit overzealous?

Url as text is for more practical reasons. I index a lot of Asian and other foreign language domains which do not appear as their equivalent Kanji or other characters in the database and frequently take more than 255 characters.

I'm using MySQL. I'm definitely open to suggestions as to better table and query design. Please let me know if I can provide more information.

Best Answer

To help with your immediate problem, adding an index on wordlocation.urlid will help your query a lot.

For scalability, it sounds like your best route might be to add a field to urllist that can be easily referenced to see which urls have been indexed. For example a tinyint column called indexed with a default of zero (so new entries are always zero). Then when you index a url, update this column to a one.