Mysql – How to store static list of hashes in MySQL effectively

MySQLmysql-5.5

Problem: I have a list of pairs (md5_hash, id). I want to store the data in MySQL and I want to make queries like this

SELECT id FROM table WHERE md5_hash = <some_hash>

The number of pairs can be tens or hundred of millions and the number is static – i.e., I do not add new records. It is important for me to save disk space. The time is not so important in this case (if a lookup takes less than say 1 second it is okey).

My thoughts:

I started by creating a table where md5_hash is represented in this way:

CREATE TABLE `myTable` (
   `md5` binary(16) NOT NULL, ' the values are not unique but we can suppose they are.
   `id` int(10) unsigned NOT NULL
) ENGINE=MyISAM

so md5 hashes are represented as binary numbers instead of hexadecimal characters to save space. For the same reasons the columns are defined as "NOT NULL". I chose MyISAM engine because my tests show that MyISAM requires less disk space than InnoDB. In addition MyISAM tables can be compressed by myisampack utility.

Now the hard part comes. When I create an index on md5 column the index requires more disk space than the data itself! I tested it with 17 millions records and my table took around 300 MB of disk space and the index took about 330 MB of disk space. The size of the index is crazy.

One idea is to provide the md5 hashes presorted to MySQL which should result in smaller index size. But I do not know how to do that.

Another idea is to divide myTable in several smaller tables in order to decrease size of the index. I tried mysql partitioning. However, the purpose of the feature is to improve query time and not disk space usage.

Do you have any idea how to decrease the required disk space usage?

Best Answer

Some options:

  • Use InnoDB. If not for these reasons, then for compression! With InnoDB you can:

    ALTER TABLE my_table ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;
    

    Replace 8 with 4 or 2 or 1 to (hopefully) get better compression. Since the table is static, I think this is a great solution for you (with strong compression, writes become slower, but you don't care about that).

    Not only data is compressed - indexes also. I would suggest this is the easiest option you have.

  • Only index part of your column. You agree that the column is not UNIQUE but can be assumed to be. OK; is your index a UNIQUE index? If not, try to:

    ALTER TABLE my_table ADD INDEX `md5_prefix_idx` (md5(8))
    

    to only index first 8 bytes of the columns.

  • Try TokuDB, which is an alternative storage engine to InnoDB, and which has an amazing compression (I've seen data get 20 times smaller than InnoDB with TokuDB aggressive compression).