Mysql – Hundreds of millions of rows: Index by composite BIGINT, VARCHAR, or CHAR of HASH

index-designindex-tuninginnodbmysql-5.7primary-key

Given the following:

Target platform: MySQL 5.7, using InnoDB.

Scenario: Storing hundreds of millions of email addresses (plus some properties not used in queries). All queries will be done by knowing the email address beforehand.

Proposed solution:

  1. SHA256 the email address.
  2. Shard "the email table" by taking the first 3 bytes of the ASCII SHA256, creating 4096 tables (from 0x000 to 0xFFF) that will act as buckets for the email addresses. This tries to avoid having one single huge table.

Question: Which of the following would be a good PK to use inside each one of those 4096 buckets in terms of performance (indexing as being more important than querying)? Use of disk space might not be that important upfront (unless there are some heavy arguments to take this into account, which I'm open to know and discuss, of course).

  1. VARCHAR(255) of the email address? This is of course the simplest.
  2. CHAR(64) of the ASCII SHA256 hash of the email address? Long shot: I'm considering that indexing and comparing fixed length strings (CHAR) is faster than variable length strings (VARCHAR).
  3. Split the SHA256 into 8 64bit integers, then create a composite PK of 8 BIGINT columns and index/query by those 8 BIGINT columns instead of using a VARCHAR/CHAR? Crazy idea: Perhaps using only 64bit integers for indexing and querying can provide a noticeable improvement in index and query performance (and also perhaps in disk access/storage). Although this is a composite PK of 8 BIGINT columns :\

Thanks in advance,

Best Answer

There is no advantage in any form of PARTITIONing based on what you have said.

The simplest, most obvious, index is also the best: Have the PRIMARY KEY be the email address (if it is unique), or start with the email address plus something to make it unique.

This is possible

id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT -- for uniqueness
...
PRIMARY KEY(email_addr, id)  -- id tacked on for uniqueness
INDEX(id)  -- to keep A_I happy.

What makes you think that "one huge table" is a problem? No, PARTITION BY HASH does not buy you anything.

You said "shard" and mentioned 4096 buckets. Technically that means you have 4096 servers to put parts of the data on. I assume you really meant PARTITION or (yuck) 4096 manually-handled tables, which is limited to a single MySQL instance.

If you really did mean spread across multiple servers, then, sure, the bottom chunk of the SHA256 is a good idea. MD5 is probably faster and just as good for this purpose. I would then use a dictionary lookup with 4096 elements to point to however many servers you currently have.