Mysql – Estimate query speed for +1.5 billion rows thesql table

database-designmemorymyisamMySQLperformancequery-performance

I'm working on audio fingerprinting problem where I need to
query a very large table in terms of number of rows (at least 1.5 billion rows), but relatively OK in size (23G), and retrieve about 50K to 100K rows in total, using multiple queries (between 20 and 50 queries).

The table has 3 columns, a hash and two int values. No constraints whatsoever. The hash column has a lot of collisions/duplicates. Here is the output for show create table

CREATE TABLE `fingerprints` (
  `hash` binary(10) NOT NULL,
  `int1` mediumint(8) unsigned NOT NULL,
  `int2` mediumint(8) unsigned NOT NULL,
  KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

The query is simple, here is an example:

select int1 ,int2 from fingerprints where hash in (UNHEX("1ff99335cce004f2765d"),UNHEX("14c4b93ed575982ed2e4"),UNHEX("41044b0cf21dc8ac8f9b"),UNHEX("a791403ca116b4da53dd"),UNHEX("d9f91514b900c25fa095"),UNHEX("3349f906deae6cd32883"),UNHEX("221c0e3e2bc243fb0fe5").... more here);

I've tried different hardware specs (using AWS with only one machine/instance). Different my.cnf configurations but no significant performance boost.

the target speed threshold for this operation (total queries time) is 5 sec. But the best I've got in average is 3 sec for only one single query (if I have 20 queries, the total operation time is 1 minute).

final note: when profiling the query, SHOW profile command shows that the slowest part was (SENDING DATA) state. The query becomes slower when the result set is larger (i.e retrieving 10k rows takes about 6 sec, while retrieving 1000 rows takes 2 sec)

Questions:

  • What is the speed estimation for such query scenario for an SSD machine with enough RAM to hold indices. I have no experience working
    at this scale.
  • Do you have recommendation for specific db setup? should I try mysql Memory engine? partitioning here is a necessity with distributed
    machines? should I switch to innodb?

my setup:

  • read only myisam table compressed with myisampack and indexed on the where (hash) column.
  • the index table (MYI file) is fully loaded to RAM
  • SSD hard disk with limited iops (amazon AWS). According to AWS graphs i'm hitting 700 Iops sometimes.

Edit:

SHOW INDEX output:

+--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table        | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| fingerprints |          1 | hash     |            1 | hash        | A         |        NULL |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

EXPLAIN QUERY output (for the example query)


+----+-------------+--------------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+
| id | select_type | table        | partitions | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+--------------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | fingerprints | NULL       | range | hash          | hash | 10      | NULL | 4912 |   100.00 | Using index condition |
+----+-------------+--------------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+

Best Answer

(UNHEXing is not a significant part of the problem.)

The real problem is the randomness of hashes. The leads to jumping around lots of places on disk. Let's dissect the query.

  • The IN list is a list of values scattered throughout the INDEX(hash).
  • Each value is looked up by drilling down a BTree (found in .MYI file) that is cached in MyISAM's key_buffer. What is the value of key_buffer_size? What is the result of `SHOW TABLE STATUS LIKE 'fingerprints' \G ?
  • If the index_size is greater than key_buffer_size, then many of the lookups will hit the disk.
  • At the leaf node of each BTree lookup will be a 5-byte (I think) "record number".
  • Now to lookup the row -- This will be a random disk access (a seek, no BTree) into the fingerprints.MYD at offset = 17 * record_number. (The records appear to be FIXED length of 17 bytes.)
  • Again, we are looking at a probable disk hit -- now assuming that the remaining free space on disk is less than the Data_length (see the TABLE STATUS).

What to do?

Case 1: Data_length + Index_length < RAM size: Have key_buffer_size a little bigger than Index_length. Gradually both caches will fill with index and data, and the I/O will go away.

Case 2: That sum is slightly bigger than RAM: Pick one of the caches to make big enough.

Case 3: The sum is a lot bigger than RAM: You are stuck with lots of I/O until you get more RAM.

I suspect the Data_length and Index_length are about equal. I would split available RAM in half -- half for key_buffer_size, the rest for data caching.

Here are 2 more ideas:

  • Rather than fetching the ints in a second step, have KEY(hash, int1, int2) This means that only the BTree lookup is needed; the data will sitting in the leaf node. With this approach, you could set key_buffer_size to 'most' of available RAM. That SELECT won't touch the data, only the index.

  • Switch to InnoDB. It's blocks are 16KB, not 1KB. This may make things faster. But the disk footprint will be 2-3 times as much. Again, use the 3-column index, but shrink key_buffer_size to 20M and raise innodb_buffer_pool_size to 70% of RAM.

Other Notes:

  • "Sending data" does not tell you anything. Profiling, in general, is useless.

  • SSDs run a lot faster than HDDs.

  • You appear to be I/O bound.

  • Whether you are I/O-bound or not, the total query time is roughly proportional to the number of hashes being looked up. (This can be deduced from my dissection.)

  • MEMORY is not likely to be significantly faster or slower than MyISAM. And if your data needs to persist, then there is a hassle because MEMORY is volatile.

  • I predict that compression is useless since you have only 6 bytes to compress. (And the hash, itself, is probably not compressible.)

  • If your provider is limiting the IOPs, that is a problem. If your index is fully cached (and not so big that it is unnecessarily eating RAM), then the IOPs are fetches of data blocks. The 3-byte key would be about 70% larger; will a large enough key_buffer fit in RAM? If so, that approach might be optimal.