I would like to store some user related data into a table. The data is identified by MD5 hash.
I have 2 options:
option 1 –
have a table with 2 columns: hash_id (unsigned bigint)
and md5 (char(32))
. Store the actual data in another table where the key is hash_id (unsigned bigint) + user_id (bigint)
.
option 2 –
Store the data in a table where the key is md5 (char(32)) + user_id (bigint)
.
Option 1 looks more "normalized", but I prefer option 2 so I wanted to know if using it will cause a major performance penalty.
In order to check, I defined both tables and tried the following operations:
- Insert 100K entries to an empty table
- Use 10K ids/MD5s that already exist in the table and for each of them run a select query and an update query (to imitate scenario in which I want to set data in an entry that already exist so I need to update it).
- Use 10K ids/MD5s that don't exist in the table and for each of them run a select query and an insert query (to imitate scenario in which I want to set data in an entry that doesn't exist so I need to create it).
- Delete 10K existing entries
In all tests besides the last one, the results for the second table (the one with the actual MD5) were much better than results for the first (the one with the hash_id).
How is it possible? I assumed that using a char key, which is longer – 32 bytes, would be worse than using a bigint – 8 bytes. What am I missing here?
I use MySQL + InnoDB.
Thanks.
Best Answer
The size and makeup of an index, composite or not, does not make a lot of difference to performance.
However, the randomness of the accesses does. This combination is a performance killer:
Let's say the index (or table) is 4 times as big as the buffer pool. Then the 'next' lookup (either for
SELECT
or for adding a row, hence updating the index), has a 3/4th chance of being a disk miss. That is 75% of such operations will incur a disk hit. If you are using an HDD, that is (a Rule of Thumb) 10ms. (SSDs are faster, but still not free.You can mitigate the problem by shrinking the table size. This can be done by shrinking datatypes. Don't use
INT
(4 bytes) for a Yes/No flag, useTINYINT
(1-byte). Don't store the MD5 asCHAR(32)
(32 bytes -- or 96 ifutf8
); useHEX()
andUNHEX()
when fetching and storing into aBINARY(16)
column.Most other types of indexes have some degree of "locality of reference". That is, two (or a thousand) consecutive references will be "near" each other, hence the caching of the buffer_pool can help avoid I/O, even if the table/index is much bigger than the buffer_pool.
More discussion on why UUIDs suck, plus a partial remedy for Type-1 UUIDs, which are time-based and it is the type that MySQL uses.
As for your benchmark...
MD5
would be jumping all around, but theBIGINT
would be hitting more cacheable stuff.