Mysql – Which of the following data duplication options across shards is recommended

MySQL

High performance mysql book suggests that for sharding a blog application, one may want to put comments data across 2 shards: first, on the shard of a person posting comment, and on the shard where the post is stored.

So this raises the question how to reliably duplicate this data. Which of the following data duplication options across shards is recommended?

Option 1: Make 2 separate inserts from the PHP script.
Pros: a) Logic is in application layer.
Cons: a) User is held for 2 inserts. b) This logic will need to be duplicated in every client trying to insert similar data.
Conclusion: Seems reasonable.

Option 2: Form federated tables and use some trigger to handle the insert of duplicate.
Pros: a) App layer doesn't need to worry about multiple inserts
Cons: a) Every shard need to have federated connection to every other shard; b) Federation will work on machines in LAN, but what about at 2 different sites. c) what if connection to federated server fails.
Conclusion: Doesn't seem like a sound idea.

Option 3: Messaging such as RabbitMQ
Pros: a) Different clients can insert data at one place, and all subscribers can consume the insert.
Cons: a) Complex; b) may impose overhead in order to host messaging server, and clients; c) not sure how will it work with a look-up service to locate appropriate shards
Conclusion: Not sure

Option 4: your suggestion?

I will greatly appreciate your help.

Best Answer

As you point out, having triggers between the various shards is silly; the whole reason for sharding is independent database operations. So you can throw it out right away.

Updating both tables at the same time is the approach with the fewest moving parts. Over the long term, it will be the most maintainable. And it will be the easiest to debug if something goes wrong.

But if response time is important, then you might think of some sort of messaging approach: update the comments-by-entry table, and queue a message to update the comments-by-user table. If it takes an hour for that message to be processed -- or if it gets lost in a system crash -- no big deal, you can always recover. By no means should you use a messaging approach to update both tables.

Answer by: @kdgregory Link: https://softwareengineering.stackexchange.com/a/134607/41398