Mysql – InnoDB, Hash Table, Reference Table and Fragmentation

fragmentationinnodbMySQLpartitioning

I have a big list of domains and URLs that I store in a database, there are around 150M domains and 300M URLs, I'm using InnoDB to store each in the format:

CREATE TABLE IF NOT EXISTS `domains_list` (
  `id` int(10) unsigned NOT NULL DEFAULT '0',
  `domain` varchar(350) DEFAULT NULL
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Which is fine. New records are assigned what is effectively an auto-increment, so there's no fragmentation and inserts are in ascending order.

When new data comes in however (typically between 250K and 2M rows), I use two separate 'hash' tables to see whether the domain or URL already exists in the database. Strictly speaking it's not a 'hash' table, just a bunch of MD5s which I use ensuring values are unique, with the added benefit of the table being fixed length. The table is also partitioned.

CREATE TABLE IF NOT EXISTS `domains_hashes` (
  `id` int(10) unsigned NOT NULL,
  `segment` tinyint(3) unsigned NOT NULL,
  `hash` binary(15) NOT NULL,
  PRIMARY KEY (`segment`,`hash`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (segment)
(PARTITION p0 VALUES LESS THAN (1) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (2) ENGINE = InnoDB,
 ...
 PARTITION p254 VALUES LESS THAN (255) ENGINE = InnoDB,
 PARTITION p255 VALUES LESS THAN (256) ENGINE = InnoDB) */;

segment is basically the first character of the hash, which is used for the partition. With the remaining 15 bytes going into hash.

For seeing whether a bunch of domains already exist in the database, this works relatively well, however, the table gets fragmented due to the random nature of insertions.

The hash table is basically only used for insertions and quickly looking up whether a domain exists in the DB or not. During insertions, a script walks from 0-255 and performs the necessary check.

My question is, do you know of a better procedure in order to handle inserts/selects better? I believe when I started out with this database I simply had a key on domains_list.domain, which was slow.

I find that when the partitions are re-organised the lookups are very quick, but after a number of batch insertions, the same lookups slow down somewhat. Server has 32GB of RAM and I use 16GB for the buffer pool, while the table itself takes up 5.4GB on disk.

Best Answer

Several ideas come to mind:

The hash table could be a memory table, once your inserts are done, you could then copy just the domains that don't already exist into the permanent domain_list table. Having your new data in memory might also make the domain name comparison between the hash table and the domain_list tables go faster.

Another option would be to use the innodb_file_per_table option and keep the database (idb) files on a faster disk.

A third option would be to load your entire dataset into a memory table and then convert that table to InnoDB using the ALTER TABLE table_name ENGINE=InnoDB. This will be much faster than inserting into InnoDB directly. This is possible since you control how and when data is inserted.