Mysql – Estimate query speed for +1.5 billion rows thesql table

database-designmemorymyisamMySQLperformancequery-performance

I'm working on audio fingerprinting problem where I need to
query a very large table in terms of number of rows (at least 1.5 billion rows), but relatively OK in size (23G), and retrieve about 50K to 100K rows in total, using multiple queries (between 20 and 50 queries).

The table has 3 columns, a hash and two int values. No constraints whatsoever. The hash column has a lot of collisions/duplicates. Here is the output for show create table

CREATE TABLE `fingerprints` (
  `hash` binary(10) NOT NULL,
  `int1` mediumint(8) unsigned NOT NULL,
  `int2` mediumint(8) unsigned NOT NULL,
  KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

The query is simple, here is an example:

select int1 ,int2 from fingerprints where hash in (UNHEX("1ff99335cce004f2765d"),UNHEX("14c4b93ed575982ed2e4"),UNHEX("41044b0cf21dc8ac8f9b"),UNHEX("a791403ca116b4da53dd"),UNHEX("d9f91514b900c25fa095"),UNHEX("3349f906deae6cd32883"),UNHEX("221c0e3e2bc243fb0fe5").... more here);

I've tried different hardware specs (using AWS with only one machine/instance). Different my.cnf configurations but no significant performance boost.

the target speed threshold for this operation (total queries time) is 5 sec. But the best I've got in average is 3 sec for only one single query (if I have 20 queries, the total operation time is 1 minute).

final note: when profiling the query, SHOW profile command shows that the slowest part was (SENDING DATA) state. The query becomes slower when the result set is larger (i.e retrieving 10k rows takes about 6 sec, while retrieving 1000 rows takes 2 sec)

Questions:

What is the speed estimation for such query scenario for an SSD machine with enough RAM to hold indices. I have no experience working
at this scale.
Do you have recommendation for specific db setup? should I try mysql Memory engine? partitioning here is a necessity with distributed
machines? should I switch to innodb?

my setup:

read only myisam table compressed with myisampack and indexed on the where (hash) column.
the index table (MYI file) is fully loaded to RAM
SSD hard disk with limited iops (amazon AWS). According to AWS graphs i'm hitting 700 Iops sometimes.

Edit:

SHOW INDEX output:

+--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table        | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| fingerprints |          1 | hash     |            1 | hash        | A         |        NULL |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

EXPLAIN QUERY output (for the example query)


+----+-------------+--------------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+
| id | select_type | table        | partitions | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+--------------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | fingerprints | NULL       | range | hash          | hash | 10      | NULL | 4912 |   100.00 | Using index condition |
+----+-------------+--------------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+

Best Answer

(UNHEXing is not a significant part of the problem.)

The real problem is the randomness of hashes. The leads to jumping around lots of places on disk. Let's dissect the query.

The IN list is a list of values scattered throughout the INDEX(hash).
Each value is looked up by drilling down a BTree (found in .MYI file) that is cached in MyISAM's key_buffer. What is the value of key_buffer_size? What is the result of `SHOW TABLE STATUS LIKE 'fingerprints' \G ?
If the index_size is greater than key_buffer_size, then many of the lookups will hit the disk.
At the leaf node of each BTree lookup will be a 5-byte (I think) "record number".
Now to lookup the row -- This will be a random disk access (a seek, no BTree) into the fingerprints.MYD at offset = 17 * record_number. (The records appear to be FIXED length of 17 bytes.)
Again, we are looking at a probable disk hit -- now assuming that the remaining free space on disk is less than the Data_length (see the TABLE STATUS).

What to do?

Case 1: Data_length + Index_length < RAM size: Have key_buffer_size a little bigger than Index_length. Gradually both caches will fill with index and data, and the I/O will go away.

Case 2: That sum is slightly bigger than RAM: Pick one of the caches to make big enough.

Case 3: The sum is a lot bigger than RAM: You are stuck with lots of I/O until you get more RAM.

I suspect the Data_length and Index_length are about equal. I would split available RAM in half -- half for key_buffer_size, the rest for data caching.

Here are 2 more ideas:

Rather than fetching the ints in a second step, have KEY(hash, int1, int2) This means that only the BTree lookup is needed; the data will sitting in the leaf node. With this approach, you could set key_buffer_size to 'most' of available RAM. That SELECT won't touch the data, only the index.
Switch to InnoDB. It's blocks are 16KB, not 1KB. This may make things faster. But the disk footprint will be 2-3 times as much. Again, use the 3-column index, but shrink key_buffer_size to 20M and raise innodb_buffer_pool_size to 70% of RAM.

Other Notes:

"Sending data" does not tell you anything. Profiling, in general, is useless.
SSDs run a lot faster than HDDs.
You appear to be I/O bound.
Whether you are I/O-bound or not, the total query time is roughly proportional to the number of hashes being looked up. (This can be deduced from my dissection.)
MEMORY is not likely to be significantly faster or slower than MyISAM. And if your data needs to persist, then there is a hassle because MEMORY is volatile.
I predict that compression is useless since you have only 6 bytes to compress. (And the hash, itself, is probably not compressible.)
If your provider is limiting the IOPs, that is a problem. If your index is fully cached (and not so big that it is unnecessarily eating RAM), then the IOPs are fetches of data blocks. The 3-byte key would be about 70% larger; will a large enough key_buffer fit in RAM? If so, that approach might be optimal.

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Sql-server – SQL Server 2005 – Query optimization for fetching large number of rows from table with 750 million rows

If you are determined to retain your existing table structure, the best indexes will be something like:

create index ix1 on fp_core(hashkey, note_id, timeoffset);
create index ix1 on fp_core_keys(hashkey);

You mention that the hashkeys can be different for the same song, but it sounds like you have some mechanism for detecting similar songs - if possible, add an additional column called 'songid' to your fp_core table and then query on this single value using a WHERE clause.

Best Answer

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Sql-server – SQL Server 2005 – Query optimization for fetching large number of rows from table with 750 million rows

Related Question