MySQL indexing on 11 million rows

indexMySQLmysql-5.5temporary-tables

I have a development and production server, both with different amounts of RAM. Development machine has 8G, while production has 64G.

I'm importing a large table from development to production, but before I started the process I dropped all the indexes to speed the import process, because I'm using stored procedure and with indexes it was very very slow.

On the development machine, while I was playing with creating/dropping indexes, I noticed that mysql used temporary space on the disk (default /tmp) for creating the indexes and it got to almost 8G.

The question is, since the production server has 64G of ram, will the mysql use the RAM for storing the temporary table while creating the indexes or will it use /tmp for the same thing?

Best Answer

What "import" mechanism was used? Some will automatically postpone the secondary-index creation for you.

I don't understand "I'm using stored procedure and with indexes it was very very slow". SPs being used for the import? Or indexes interfering with SPs? What?

Is /tmp on the same disk and in the same disk 'partition'? This is critical when it needs to "move" the result to the mysql directory.

You really need to upgrade -- 5.6 has some indexing improvements; 5.7 and 8.0 have more.

MySQL uses a combination of RAM and disk -- trying to minimize I/O. So, the question of "will it use RAM" is moot. It will "do what is optimal".

To create a secondary, non-UNIQUE, BTree index, MySQL probably

reads the entire data, writing the secondary keys and primary keys to a file
sorts that file (OS file sort is quite efficient)
reads the file and builds the BTree for the INDEX; this will involve writing to disk.

However. "read" and "write" and "file" are cached in based on InnoDB's whim and the OS's whim. Furthermore, step 3 may delay the actual write to disk of the BTree.

UNIQUE requires a check; FULLTEXT and SPATIAL were not available in InnoDB in 5.5. If you are using Engine=MyISAM, some of what I have said is incorrect.

A related question -- Is that OS sort faster than inserting the rows one at a time and building the index on the fly? The answer is "probably". On-the-fly requires keeping index entries in the in-RAM key_buffer or buffer_pool, plus flushing such as it goes along. That gets into another long discussion.

More

The main part of ALTER TABLE (at least on 5.5 and before) involves reading the entire table and (depending on the specific Alter) rewriting the data BTree and/or writing index BTree(s).

If tmpdir refers to a directory that is on the same disk partition as the main mysql/ directory then various ALTER TABLEs can "move" a temp file (which such is required) from tmpdir to mysql/databasename/ at the end of the Alter. The "move" is accomplished (at least on Linux) by a couple of quick directory tasks. If different disk partitions are needed, a physical "copy" is needed; this requires a big read and write.

A file sort is likely to be used as one step in building an index. How this is done is up to the OS. But it is equivalent to reading and writing the data being sorted (the index's BTree) one or more times (depending on number of rows in table). For a tiny table, the OS will effectively cache the effort in RAM. For a huge table multiple "sort-merge" passes will be required.

Sorting is "O(N*logN)". That is, if you double the number of items, it takes more than twice as long to complete the sort. (However, I/O dominates the time consumed, so one can estimate time very precisely.)

If the server is configured with multiple not-big-enough physical disks, the above should help with understanding that part of the Alter.

It is optimal (for MySQL) to give most of RAM to MySQL. But, when sorting is necessary, the OS needs some room, too. The tradeoff here is beyond my understanding of the details.

Related Solutions

MySQL Indexing – How to Optimize MySQL Setup for Faster Index Creation

For starters, I would not touch the buffer sizes just yet. The sizes youhave in the question are monstrously too big.

Here is another observation: You have BLOB data. Ouch, your temp table is going to eat space rather quickly. You could do somehting like this:

Create a 32GB RAM Disk called /var/tmpfs by adding this line to /etc/fstab

none                    /var/tmpfs              tmpfs   defaults,size=32g        1 2

Next, create a folder called /mysqltmp and mount the RAM disk on it

mkdir /mysqltmp
chown mysql:mysql /mysqltmp
mount /mysqltmp /var/tmpfs

Add this to my.cnf and restart mysql

[mysqld]
tmpdir=/mysqltmp

Now, any tmp table made via DDL lands in the RAM disk.

Here is yet another observation: Why not create a separate table that keeps the BLOB data away from the unique names?

CREATE TABLE `data_store_name` SELECT id,uniqname FROM `data_store` WHERE 1=2;
ALTER TABLE `data_store_name` ADD PRIMARY KEY (id);
ALTER TABLE `data_store_name` ADD UNIQUE KEY (uniqname);
ALTER TABLE `data_store_name` ADD INDEX name_id_ndx (uniqname,id);
INSERT INTO `data_store_name` SELECT id,uniqname FROM `data_store`;

This will prevent any moving around of BLOB data when indexing.

From here, you would have to always join data_store using its name like this:

SELECT
    A.uniqname,B.data
FROM
    (SELECT * FROM data_store_name WHERE uniqname = 'mydataname') A
    LEFT JOIN
    data_store B USING (id)
;

Making these changes will sidestep this whole mess of dealing with keycache, RAM disks, and tmp tables.

Give it a Try !!!

MySQL 5.5 Runs Out of Memory, Drops All Connections When Creating Many Databases

The first thing that comes to mind is the Server Model : db.m1.large

What limiting factors are placed on a MySQL RDS?

If you spin up an Amazon RDS instance of MySQL, you would subject yourself to whatever constraints are given. All models of MySQL Amazon RDS have the same major options but differ in only two aspects

max_connections
innodb_buffer_pool_size

Here is a Chart I posted Last Month

MODEL      max_connections innodb_buffer_pool_size
---------  --------------- -----------------------
t1.micro   34                326107136 (  311M)
m1-small   125              1179648000 ( 1125M,  1.097G)
m1-large   623              5882511360 ( 5610M,  5.479G)
m1-xlarge  1263            11922309120 (11370M, 11.103G)
m2-xlarge  1441            13605273600 (12975M, 12.671G)
m2-2xlarge 2900            27367833600 (26100M, 25.488G)
m2-4xlarge 5816            54892953600 (52350M, 51.123G)

I posted this in my other posts in the DBA StackExchange

You are using m1.large. Since the InnoDB Buffer Pool is 3/4 of the Instance. That's means you have 7.2G but only 1.8 is usable. That model is capable of having up to 623 connections. Each connection can consume memory because of

join buffers
sort buffers
read buffers
thread info

Amazon RDS is simply micromanaging resources. Since DB Connections can consume RAM, connections are probably being disallowed due to the lack of RAM needed.

SUGGESTIONS

Try reducing InnoDB log IO innodb_flush_log_at_trx_commit=0 during the mass creation
Make sure you are not doing large transactions during any automatic backups or snapshots
Try a bigger server model m1.xlarge or m2.xlarge

Best Answer

Related Solutions

MySQL Indexing – How to Optimize MySQL Setup for Faster Index Creation

MySQL 5.5 Runs Out of Memory, Drops All Connections When Creating Many Databases

SUGGESTIONS

Related Question