Mysql – How to store static list of hashes in MySQL effectively

MySQLmysql-5.5

Problem: I have a list of pairs (md5_hash, id). I want to store the data in MySQL and I want to make queries like this

SELECT id FROM table WHERE md5_hash = <some_hash>

The number of pairs can be tens or hundred of millions and the number is static – i.e., I do not add new records. It is important for me to save disk space. The time is not so important in this case (if a lookup takes less than say 1 second it is okey).

My thoughts:

I started by creating a table where md5_hash is represented in this way:

CREATE TABLE `myTable` (
   `md5` binary(16) NOT NULL, ' the values are not unique but we can suppose they are.
   `id` int(10) unsigned NOT NULL
) ENGINE=MyISAM

so md5 hashes are represented as binary numbers instead of hexadecimal characters to save space. For the same reasons the columns are defined as "NOT NULL". I chose MyISAM engine because my tests show that MyISAM requires less disk space than InnoDB. In addition MyISAM tables can be compressed by myisampack utility.

Now the hard part comes. When I create an index on md5 column the index requires more disk space than the data itself! I tested it with 17 millions records and my table took around 300 MB of disk space and the index took about 330 MB of disk space. The size of the index is crazy.

One idea is to provide the md5 hashes presorted to MySQL which should result in smaller index size. But I do not know how to do that.

Another idea is to divide myTable in several smaller tables in order to decrease size of the index. I tried mysql partitioning. However, the purpose of the feature is to improve query time and not disk space usage.

Do you have any idea how to decrease the required disk space usage?

Best Answer

Some options:

Use InnoDB. If not for these reasons, then for compression! With InnoDB you can:
```
ALTER TABLE my_table ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;
```
Replace 8 with 4 or 2 or 1 to (hopefully) get better compression. Since the table is static, I think this is a great solution for you (with strong compression, writes become slower, but you don't care about that).

Not only data is compressed - indexes also. I would suggest this is the easiest option you have.
Only index part of your column. You agree that the column is not UNIQUE but can be assumed to be. OK; is your index a UNIQUE index? If not, try to:
```
ALTER TABLE my_table ADD INDEX `md5_prefix_idx` (md5(8))
```
to only index first 8 bytes of the columns.
Try TokuDB, which is an alternative storage engine to InnoDB, and which has an amazing compression (I've seen data get 20 times smaller than InnoDB with TokuDB aggressive compression).

What about the Removal of Columns ?

You removed geo, which were all NULL. There was no shrinkage for the data. Why ? Since all geo columns were NULL, it took no space at all in every row. The mapping of offsets for all fields had geo pointing at nothing at all.

In the MySQL Documentation on FIELD CONTENTS in InnoDB, it has the following example table

CREATE TABLE T
    (FIELD1 VARCHAR(3), FIELD2 VARCHAR(3), FIELD3 VARCHAR(3))
    Type=InnoDB;

Note that both FIELD1 and FIELD3 are VARCHAR fields with NULL values.

From the Helpful Notes About NULL values at the Bottom of that Page

For the third row, I inserted NULLs in FIELD2 and FIELD3. Therefore in the Field Start Offsets the top bit is on for these fields (the values are 94 hexadecimal, 94 hexadecimal, instead of 14 hexadecimal, 14 hexadecimal). And the row is shorter because the NULLs take no space.

Thus, removing a VARCHAR column where every value was NULL will not shrink the table.

What did adding FLOAT columns actually do ?

Let's play a little math game. You said the new table has 106.7M rows.

Each FLOAT column occupies 4 bytes
You added 2 FLOATs, so that is 8 bytes
The amount of space reserved for a FLOAT is still fixed at 4 bytes, even for a NULL FLOAT.
That 853.6M bytes which is 814 MiB.

At first glance, you would think that Data should have increased from 20.8 GiB to 21.6 GiB. What should have been 4% increase, increasing by 814MB, grew to 37.3 GiB, which is a 79.3% increase in Data.

Why Did the Data Grow So Much ?

Please be aware that there is an object in the table called gen_clust_index, a.k.a. the table's Clustered Index. This is the table's PRIMARY KEY. You may find this shocking, but the PRIMARY KEY does not reside in the table's index pages. It resides in the table's Data Pages.

Question: What is the size of a page in InnoDB ? Answer: 16KB.

What is inside the 16KB of a Data Page ?

Data for one or more rows
BTREE information for the PRIMARY KEY
Leaf Nodes that point to a row are in the same page

Remember you added two FLOATs ? This makes each row 8 bytes bigger. Each row will expand in the 16KB of the Data Page, leaving less room for BTREE information. What must InnoDB do to maintain the PRIMARY KEY ? Create additional Data pages to accommodate wider rows along with the associated BTREE information.

Since the PRIMARY KEY is a BTREE, you should expect O(n log n) growth in BTREE nodes, along with an overflow of InnoDB Data pages.

Without doing any further math, I will show you how this is the case.

MyISAM has data and indexes stored separately.

Run the following

CREATE TABLE twitter_tweets_myisam LIKE twitter_tweets_new;
ALTER TABLE twitter_tweets_myisam ENGINE=MyISAM;
ALTER TABLE twitter_tweets_myisam DISABLE KEYS;
INSERT INTO twitter_tweets_myisam SELECT * FROM twitter_tweets_new;
ALTER TABLE twitter_tweets_myisam ENABLE KEYS;

I can assure you that the .MYD file of twitter_tweets_myisam will not be 37.3 GiB

Q.E.D.

Mysql – How to update 10 million+ rows in MySQL single table as Fast as possible

Based on my experience, I would use LOAD DATA INFILE to import your CSV File.

The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed.

Example I found on the internet Load Data example. I tested this example on my box and worked fine

Example Table

CREATE TABLE example (
  `Id` int(11) NOT NULL AUTO_INCREMENT,
  `Column2` varchar(14) NOT NULL,
  `Column3` varchar(14) NOT NULL,
  `Column4` varchar(14) NOT NULL,
  `Column5` DATE NOT NULL,
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB

Example CSV File

# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013

Import Statement to be run from MySQL console

LOAD DATA LOCAL INFILE '/tmp/example.csv'
    -> INTO TABLE example
    -> FIELDS TERMINATED BY ','
    -> LINES TERMINATED BY '\n'
    -> IGNORE 1 LINES
    -> (id, Column3,Column4, @Column5)
    -> set
    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');

Result

MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5    |
+----+---------+---------+---------+------------+
|  1 |         | Column2 | Column3 | 0000-00-00 |
|  2 |         | B       | Bar     | 0000-00-00 |
|  3 |         | C       | Foo     | 0000-00-00 |
|  4 |         | D       | Bar     | 0000-00-00 |
|  5 |         | E       | FOObar  | 0000-00-00 |
+----+---------+---------+---------+------------+

IGNORE just simply ignores the first line which are column headers.

After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.

Here is another example directly from Oracle : LOAD DATA INFILE example

This should be enough to get you started.

Best Answer

Related Solutions

MySQL – Indexes deleted, reduced column size but table size increases

What about the Removal of Columns ?

What did adding FLOAT columns actually do ?

Why Did the Data Grow So Much ?

Q.E.D.

Mysql – How to update 10 million+ rows in MySQL single table as Fast as possible

Related Question