Our InnoDB/Mysql5.6 database is growing at a fairly steady pace of doubling in size every year, currently at 22Gb. This is sales data; we have a number of merchants each generating millions of rows, all stored in one logical instance. The report queries are served by a few read replicas for load balancing and redundancy with the queries distributed randomly.
The database is hosted in AWS/RDS. We outgrew the 4Gb instance, then 8gb and currently reside in a 16Gb node. We expect to last until the end of the year. While AWS offers "memory optimized" instances we feel this will just defer the problem but not solve the bigger picture. Instead of scaling up, we would like to use a number of smaller instances and serve merchant specific data. We can partition the data the "hard way" where instance A only holds merchant A data and instance B only holds merchant B data and so on. The biggest drawback of this is that in case of an instance downtime we cannot just route traffic to another instance since there is no cross-account data sharing.
So I want to ask if anyone can comment on "logical partitioning" which means that all servers still hold all the data (disk space is not a problem) but queries related to one account are "sticky" i.e. routed to the same server (or server pool) to maximize cache hits. That way if a DB instance, zone or region goes down any other instance elsewhere can pick up the work of the down-instance since the same data was distributed everywhere.
In your opinion, would logical partitioning based on repeated queries to the same set of instances reduce the need for memory, assuming all the queries use indexes where the account ID is the first field in every composite key?
Best Answer
Good explanation. Good idea. I'll add my 2-cents...
"Sharding" would have a few merchants on one server (plus its replicas); a few other merchants on another shard, etc.
You are suggesting that it is practical to have all the data everywhere, but "shard" the accesses to minimize RAM requirements.
First let me make a strong suggestion. Have all
PRIMARY KEYs
be composite, starting with themerchant_id
. This will cluster all the rows for a given merchant together. This is in addition to having most of the indexes start with that same column.If you currently have
then change to
Reconsider the secondary indexes; you may not need as many, since the rows are already clustered by merchant_id.
Granted, inserting not-for-this-shard merchant rows will need to some blocks; but they will be transient and not cost too much room in
innodb_buffer_pool_size
.If you are pseudo-sharding the slaves, then if a slave dies, all the other slaves are ready to take over for any Merchant's requests. The only problem is that queries will be slow due to the lack of cached data. I see the 'instant' failover as being more of a benefit than the 'slower' is a detriment. Anyway, the caching will catch up pretty soon.
How much write traffic is there, versus read traffic?
A lot of write: Then you need to shard the Masters (and Slaves). But would this mean that you are still doing all the writes to all Masters? This is clumsy?
Mostly read traffic: (That's what I was commenting on above.)
Keep in mind that every Slave performs every write that happened on its Master. Corollary: You can't get write scaling without Sharding the Masters.
PK example
Using your specifics (from Comment): You currently have an 8-byte
id BIGINT
?-->
That is the disk footprint will be virtually identical. To elaborate:
id
. And all the desired rows will be adjacent to each other. The result: SuchSELECTs
will be faster (maybe 2x, but there are a lot of variables).Back to the "old wives tale" that you should keep the PK "small". Yeah, that is a general rule. But I have just explained an exception.
If you have several more secondary keys on the same table, then the extra secondary keys will be bulkier. But you want to skip on RAM, and don't need to worry about disk. Again, what I suggest fits well.