MySQL Sharding vs Clustering – Speed Up Slow Queries

clusteringMySQLsharding

I am having a performance problem with my current single instance database as the volumes of data grow. The application I am working on receives inserts once a month, and the oldest month is deleted at the same time. The number of rows grows in the hundred millions for the most accessed table (one row per item per month per store) and is joined on a table storing the data specific to items regardless of time (like its label).

A normal query acts as follow: Do a aggregation either per month, store or article over a different amount of store/months and categories.

I was looking at some sharding possibilities that would allow me to keep the queries as they are (I can't afford to rewrite everything for now) and run the queries exactly as if I had only one server but in the end that server would run the query on multiple shards/replications to speed things up.

Can I expect any good speed increase on currenly slow queries? So far my understanding of cluster/sharding is mostly to help reduce loads from many read/writes but this is not a problem at all for me at this time since there is not a lot of queries going on (rarely multiple queries at the same time on the server instance).

Thank you

EDIT: I'd like to point out that today, I am coping with ok speeds by having the full table in memory so I've removed the disk read times, so simply storing the data on different disks wouldn't solve all…

Best Answer

  • Sharding requires a bunch of changes to the infrastructure, connections, etc. The product "Spider" makes most of that transparent.

  • Sharding may improve scaling and concurrency for simple queries. But for table scans and JOINs, it is likely to make performance worse.

  • Partitioning is excellent for "deleting old data". See my blog. Otherwise, the DELETE competes with other queries, making the whole system suffer when you do the purging. (For 3 years, I would PARTITION BY RANGE(TO_DAYS(...)) and hav 38 partitions.) DROP PARTITION for the old month is essentially instantaneous, regardless of size.

  • Loading data can be a bottleneck. Do you use LOAD DATA? That is probably the best. Second best would be batched INSERTs of 100-1000 rows per batch.

  • If you just need read scaling, then adding Slaves gives you unlimited read scaling.

  • If you need write scaling, a Galera-based cluster might help some, but Sharding is the only real solution.

  • A few Summary tables, as you allude to, is an excellent way to get performance from a Data Warehouse setup. I blog on that, too.

  • "I can't afford to rewrite everything for now" -- Any shortcut will take just as long. And then you will still have to "rewrite everything" to take the next step.

  • Manually splitting data onto different devices is not as good as RAID striping of all the drives into one large drive.

  • A RAID controller with Battery Backed Write Cache makes writes appear to be instantaneous, without losing data.

  • Normalization can be used to keep the tables smaller, hence cacheable. Are you doing a reasonable amount of that, and using reasonably small ids (eg, SMALLINT UNSIGNED instead of BIGINT where appropriate)?

  • Slow queries -- There are too many variables to make a general statement. Let's see some specifics.

I would be happy to discuss any of these further. Since you have not said what you really need, I don't know what to focus on.