MySQL – Performance of Char+Bigint Key vs Bigint+Bigint Key

MySQLperformanceprimary-key

I would like to store some user related data into a table. The data is identified by MD5 hash.

I have 2 options:

option 1 –
have a table with 2 columns: hash_id (unsigned bigint) and md5 (char(32)). Store the actual data in another table where the key is hash_id (unsigned bigint) + user_id (bigint).

option 2 –
Store the data in a table where the key is md5 (char(32)) + user_id (bigint).

Option 1 looks more "normalized", but I prefer option 2 so I wanted to know if using it will cause a major performance penalty.

In order to check, I defined both tables and tried the following operations:

  1. Insert 100K entries to an empty table
  2. Use 10K ids/MD5s that already exist in the table and for each of them run a select query and an update query (to imitate scenario in which I want to set data in an entry that already exist so I need to update it).
  3. Use 10K ids/MD5s that don't exist in the table and for each of them run a select query and an insert query (to imitate scenario in which I want to set data in an entry that doesn't exist so I need to create it).
  4. Delete 10K existing entries

In all tests besides the last one, the results for the second table (the one with the actual MD5) were much better than results for the first (the one with the hash_id).

How is it possible? I assumed that using a char key, which is longer – 32 bytes, would be worse than using a bigint – 8 bytes. What am I missing here?

I use MySQL + InnoDB.

Thanks.

Best Answer

The size and makeup of an index, composite or not, does not make a lot of difference to performance.

However, the randomness of the accesses does. This combination is a performance killer:

  • "Random" index (or part of index): UUID / MD5 / SHA256 / etc.
  • Index (or table) is bigger than can be cached in RAM (in the buffer_pool).

Let's say the index (or table) is 4 times as big as the buffer pool. Then the 'next' lookup (either for SELECT or for adding a row, hence updating the index), has a 3/4th chance of being a disk miss. That is 75% of such operations will incur a disk hit. If you are using an HDD, that is (a Rule of Thumb) 10ms. (SSDs are faster, but still not free.

You can mitigate the problem by shrinking the table size. This can be done by shrinking datatypes. Don't use INT (4 bytes) for a Yes/No flag, use TINYINT (1-byte). Don't store the MD5 as CHAR(32) (32 bytes -- or 96 if utf8); use HEX() and UNHEX() when fetching and storing into a BINARY(16) column.

Most other types of indexes have some degree of "locality of reference". That is, two (or a thousand) consecutive references will be "near" each other, hence the caching of the buffer_pool can help avoid I/O, even if the table/index is much bigger than the buffer_pool.

More discussion on why UUIDs suck, plus a partial remedy for Type-1 UUIDs, which are time-based and it is the type that MySQL uses.

As for your benchmark...

  • In a 'real' table, there would be several other columns.
  • You did not increase the number of rows until it started getting a significant number of disk misses.
  • At that point, the 2-column lookup table could stay cached, but the bigger 'real' table could not.
  • So, if there was "locality of reference" (such as looking only at 'recent' rows), the MD5 would be jumping all around, but the BIGINT would be hitting more cacheable stuff.