I'm going to give this my best guess, but I will preface saying that we don't have intimate knowledge of your schema, and while I am knowledgeable about MySQL, I would not call myself an expert.
Index-organized tables
One problem you might be having is due to the way in which MySQL stores data, along with your UPSERT
behavior. Using the InnoDB engine in MySQL, every table is an index-organized table. This means that the data payload is stored directly within the pages of the index. In effect, the primary key index is also the table structure.
Now, in your case, you are inserting rows with a total of 42 columns. Without knowing the table structure, I am guessing the some of these 40 unknown columns are of variable length. So each time you update the data, there is likely to be quite a lot of data movement due to this. InnoDB has to rewrite the index and data pages in order to accommodate the new data.
Change your schema?
So, based on this likely limitation you are seeing, I think one thing you should consider is a schema change. What you have now is essentially one big flat table. But, you've stated that you receive only 10 of these 40 unknown columns at a time, so why don't you have at least 5 tables? That is (in loose pseudo code, and make sure to add your FK relationships):
CREATE TABLE record (RecID, DATE, NAME);
CREATE TABLE column_set_1 (RecID,COL1,...,COL10);
CREATE TABLE column_set_2 (RecID,COL11,...,COL20);
CREATE TABLE column_set_3 (RecID,COL21,...,COL30);
CREATE TABLE column_set_4 (RecID,COL31,...,COL40);
This way, you are only performing inserts, and only into the relevant tables without touching all this other data.
Table Partitioning
Also, because you have a DATE
field, it may be helpful to partition your table. I can't say for sure if this will have a strong positive impact, but I suspect it may. If you go for my suggested schema change, for example, I might choose to hash partition the 4 column_set_*
tables based on RecID
and then the INSERT...ON DUPLICATE KEY UPDATE
can make use of partition lock pruning.
Best of luck, and maybe someone with better MySQL expertise can correct any mistakes I've made.
This is a good question,
working in cloud environment is tricky sometimes.
You can use pg_dumpall -f dump.sql
command, that will dump your entire database to a SQL file format, In a way that you can reconstruct it from scratch pointing to other endpoint. Using psql -h endpoint-host.com.br -f dump.sql
for short.
But to do that, you will need some EC2 instance with some reasonable space in disk (to fit your database dump). Also, you will need to install yum install postgresql94.x86_64
to be able to run dump and restore commands.
See examples at PG Dumpall DOC.
Remember that to keep integrity of your data, it is recommended (some cases it will be mandatory) that you shutdown the systems that connect to the database during this maintenance window.
Also, if you need speed up things, consider using pg_dump
instead pg_dumpall
, by taking advantage of parallelism (-j njobs
) parameter,
when you determine the number of CPUs involved in the process, for example -j 8
will use until 8 CPUs. By default the behavior of pg_dumpall
or pg_dump
is use only 1. The only advantage by using pg_dump
instead pg_dumpall
is that you will need to run the command for each database that you have, and also dump the ROLES (groups and users) separated.
See examples at PG Dump DOC and PG Restore DOC.
Best Answer
No. Assuming that Amazon accounts IOPs accurately (after all, they are virtual machines), there are some abstractions between SQL and disk IO operations:
com_select
,com_insert
, etc. counters onSHOW GLOBAL STATUS
. Please note that if you have a cache like the query_cache, close to no disk operations will be done (because results will be returned directly from memory).'Handler_%'
counters onSHOW GLOBAL STATUS
I have not gone in detail about MySQL and OS internals, and even without doing that you can see that things are not obvious- in fact reducing IO and knowing why there is so much of it in some cases is one of the most important fields for database optimization.
An engine like InnoDB will require all of its pages to be written 3 or 4 times on insertion (one on the transaction log, one on the actual tablespace, one on the double write buffer and optionally, on the binary log)- and this is a simplification- it does not have into account index updates, metadata, statistics, etc. The best way to know how many IOPS you will need is to test on a particular setup. Even with some fake smaller tests it will be more reliable than tying to theorise. Caching both at OS and DB level will make the numbers dramatically different.
For example, the other day I inserted dozens of millions of records at 200.000 rows/s with
LOAD DATA
(no SQL overhead) because I had enough memory to write almost exclusively to the InnoDB buffer pool. It took several minutes for the disk to synchronise with the memory contents, though.