Mysql – Optimize speed on 400GB MyISAM table

mariadbmyisamMySQL

We have MariaDB server with single table with image url information. The table is about 400GB on disk and contains probably 400M rows.

Table is partitioned in 1024 partitions.

All queries are similar to this one:

select * from container where id in (1234, 1235 ... );

The sql usually took 2 sec to be executed.

Each row contains a single image url, a title and keywords.

Keybufer is set to 8GB.

This set up works well, until we begin to insert aditioan rows. We tried normal inserts, also low_priority inserts. It is slow in both cases.

I wonder what else we can tweak in order to speed up the selects.

Update:
the table has no indexes, except the primary key on a bigint field e.g. primary key(id)

Update 2:
Here is some more information:

Create Table

CREATE TABLE `container` (
  `id` bigint(20) NOT NULL,
  `data` blob NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=binary
/*!50100 PARTITION BY KEY (id)
PARTITIONS 1024 */

data field stores standard JSON. It is UTF8 text, but because of wrong encoding of the input data, we were forced to store it in blob + binary encoding.

Size on disk

# du -h /usr/local/mysql/var/mydb/
371G    /usr/local/mysql/var/mydb/

Count(*)

> select count(*) from container;
+-----------+
| count(*)  |
+-----------+
| 409036295 |
+-----------+
1 row in set (0.04 sec)

my.ini

[mysqld]
server-id = 1

port=3306
socket=/tmp/mysql.sock
skip_name_resolve

open-files-limit=64000

#Flush every 5 min (300 sec)
set-variable = flush_time=900

#Max Clients
set-variable = max_connections=5050
set-variable = max_user_connections=5000
set-variable = back_log=50

set-variable = table_open_cache=1024
set-variable = table_definition_cache=1024

#INSERT While SELECT-ing. Default is 1 (1 = On if have no hole, 2 = On if have hole)
set-variable = concurrent_insert=2

#Interactive timeout 60 min (from console)
set-variable = interactive_timeout=3600

#non-interactive timeout 3 hours
set-variable = wait_timeout=10800

set-variable = key_buffer_size=8192M
set-variable = max_allowed_packet=5M
set-variable = sort_buffer_size=256M

set-variable = tmp_table_size=512M
set-variable = max_heap_table_size=64M

#all updates will wait for selects
set-variable = low_priority_updates=1

#preforking
set-variable = thread_cache_size=64

#----- SLOW QUERIES -----

set-variable = long_query_time=2
set-variable = log_slow_queries=mysql-slow.log

#----- CASHE -----

# SELECT SQL_CACHE * from x
set-variable = query_cache_type=0
set-variable = query_cache_limit=1M
set-variable = query_cache_size=128M

Best Answer

Assuming you are SELECTing only by the PRIMARY KEY, then the following will speed it up:

NO PARTITIONing. It only slows down such queries, especially because of 1024 partitions.
Switch to InnoDB. The PRIMARY KEY is "clustered" with the data; this will save a disk hit on each row fetched. Shrink key_buffer_size to 50M and raise innodb_buffer_pool_size to 70% of available RAM. InnoDB also avoids table locks.
Compress, in the client, the JSON text; the datatype needs to be BLOB (as it mistakenly is now).

Note: InnoDB will expand the disk space by a factor of 2-3; the compression will get that space back. (Please test this with, say, a million rows. Experiment with the different ROW_FORMATs; I don't know which one will work best for your situation. And validate the SELECT speed and the lack of interference between reads and writes.)

Do you really have 5000 connections at the same time? They are probably stumbling over each other and slowing down each other.

You have half turned off the Query cache; also do query_cache_size=0

Related Solutions

MySQL Cluster ndb_desc and parttions

With noofreplicas=2 and two nodes, you will have 2/2 = 1 nodegroup, meaning that you have no sharding at all (or one shard, as you prefer). All your partitions will be in both nodes. Your cluster will provide you high availability and better read throughup, but not write scaling.

What you show are the statistics for the table users, if you have not defined a custom partitioned schema, NDB will manage the partitioning for you using a hash on the primary key, in this case 4 partitions are created. Out of those 4 partitions, probably 2 will be "primary" on one node and "backup" in the other, and vice versa. You can define your own partitioning schema with standard MySQL Syntax (although having less partitions may reduce concurrency -MaxNoOfExecutionThreads- and future scaling -making the adding nodes more painful). You can know which partition each individual row is with EXPLAIN PARTITIONS.

Think of NDB as an implementation of RAID 10, but as you only have 2 nodes, you only get a RAID1. Your "block level" is a partition, each the default partition selection uses the hashing used for KEY() partitioning, which I believe it is based on the PASSWORD() function.

I believe the number of partitions by default, as you say, should be equal to the number of nodes (2), but I believe that it can create it with a larger number if you setup a larger MAX_ROWS, so it is not fixed.

Edit thanks to extra information: The other reason why partitions can be set higher by default is because, to allow for higher concurrency -in case you are using ndbmtd-, the number of partitions get multiplied by the Local Data Manager instances (in a nutshell, how many threads can potential read and write to a data node in concurrency). When you set MaxNoOfExecutionThreads to 4, as in your case, 2 of them are dedicated to that in the multithreaded server.

I must apologize for not thinking about this before, but there are tons of small changes from version to version, and the number of partitions are usually 4 in shard-enabled 4-node setups.

Do not worry too much, as even if you had extra partitions of that size (as you can force manually, to some extent) is not a bad thing at all. Partitions (or fragments) are just the arbitrary unit that is indivisible at node level. As I commented earlier extra partitions can provide additional performance by pruning non-desired rows (assuming you are partitioning correctly and using mostly key-value reads), and make easier the addition of extra nodes.

Mysql – Why are simple SELECTs on InnoDB 100x slower than on MyISAM

YOUR QUERY

SELECT post.postid, post.attach FROM newbb_innopost AS post WHERE post.threadid = 51506;

At first glance, that query should only touches 1.1597% (62510 out of 5390146) of the table. It should be fast given the key distribution of threadid 51506.

REALITY CHECK

No matter which version of MySQL (Oracle, Percona, MariaDB) you use, none of them can fight to one enemy they all have in common : The InnoDB Architecture.

InnoDB Architecture

CLUSTERED INDEX

Please keep in mind that the each threadid entry has a primary key attached. This means that when you read from the index, it must do a primary key lookup within the ClusteredIndex (internally named gen_clust_index). In the ClusteredIndex, each InnoDB page contains both data and PRIMARY KEY index info. See my post Best of MyISAM and InnoDB for more info.

REDUNDANT INDEXES

You have a lot of clutter in the table because some indexes have the same leading columns. MySQL and InnoDB has to navigate through the index clutter to get to needed BTREE nodes. You should reduced that clutter by running the following:

ALTER TABLE newbb_innopost
    DROP INDEX threadid,
    DROP INDEX threadid_2,
    DROP INDEX threadid_visible_dateline,
    ADD INDEX threadid_visible_dateline_index (`threadid`,`visible`,`dateline`,`userid`)
;

Why strip down these indexes ?

The first three indexes start with threadid
threadid_2 and threadid_visible_dateline start with the same three columns
threadid_visible_dateline does not need postid since it's the PRIMARY KEY and it's embedded

BUFFER CACHING

The InnoDB Buffer Pool caches data and index pages. MyISAM only caches index pages.

Just in this area alone, MyISAM does not waste time caching data. That's because it's not designed to cache data. InnoDB caches every data page and index page (and its grandmother) it touches. If your InnoDB Buffer Pool is too small, you could be caching pages, invalidating pages, and removing pages all in one query.

TABLE LAYOUT

You could shave of some space from the row by considering importthreadid and importpostid. You have them as BIGINTs. They take up 16 bytes in the ClusteredIndex per row.

You should run this

SELECT importthreadid,importpostid FROM newbb_innopost PROCEDURE ANALYSE();

This will recommend what data types these columns should be for the given dataset.

CONCLUSION

MyISAM has a lot less to contend with than InnoDB, especially in the area of caching.

While you revealed the amount of RAM (32GB) and the version of MySQL (Server version: 10.0.12-MariaDB-1~trusty-wsrep-log mariadb.org binary distribution, wsrep_25.10.r4002), there are still other pieces to this puzzle you have not revealed

The InnoDB settings
The Number of Cores
Other settings from my.cnf

If you can add these things to the question, I can further elaborate.

UPDATE 2014-08-28 11:27 EDT

You should increase threading

innodb_read_io_threads = 64
innodb_write_io_threads = 16
innodb_log_buffer_size = 256M

I would consider disabling the query cache (See my recent post Why query_cache_type is disabled by default start from MySQL 5.6?)

query_cache_size = 0

I would preserve the Buffer Pool

innodb_buffer_pool_dump_at_shutdown=1
innodb_buffer_pool_load_at_startup=1

Increase purge threads (if you do DML on multiple tables)

innodb_purge_threads = 4