Mysql – Sync tables from Host1 to Host2 with over 50 million rows

load balancingMySQLreplication

I apologize first for using Sync in the title, that may be the wrong term. I am using MySQL, but can switch to Postgress or other open source platform.

I have a service that when people subscribe to we receive a lot of data from them, around ~60,000 rows that gets split up across multiple tables using a http endpoint processed by php. After the initial sync, we only receive incremental updates.

Looking at server load, we became concerned that it would become an issue as we added more and more customers and so setup a separate server that would handle all the http requests, parse the data and place it in the right tables. As load increased, we could then section off customers to other identical servers.

The question is what should I do to effectively replicate the data on those finished tables over to the production server (same network)? I could setup triggers, but that would result in tens of thousands of individual connection requests for each insert.

If I setup were to setup mirroring / master-slave, I want it to ignore all existing data because I wouldn't want it to try comparing things that have already been sent on a regular basis since that could grow to 50 million rows quickly.

It would be more like :

SELECT @rowcount:=COUNT(*) FROM production.database.table1;
INSERT INTO production.table1 SELECT * FROM workhorse.database.table1 where `id` >@rowcount;

Something this simple could be run once a minute or so I would think and cut down on individual connections. I realize this is kind of messy since COUNT and ID aren't locked together, but it gets the idea across. Also, I don't know a way to run the query on a remote server.

Are there any tools/commands that would make quick work of this?

Best Answer

Master-Slave (and various other forms of replication) provide keeping two (or more) hosts in sync continually. No ETL process needed.

Splitting the client processes and MySQL into separate servers is a good first step.

Putting different customers on different "shards" provides 'infinite' scalability in that dimension. You would need some form of proxy or home-grown directing to get from a client to the necessary server.

Ingesting 60K rows is pretty trivial unless, of course, the data format is always something different. At that point, I might contract a programmer who knows Perl (or other tool) to crank out ingestion scripts. (Years ago, I did such with news feeds.)

Does 50M rows at 60K per customer mean you have 1000 customers?

I would suggest you use sharding do deal with excessive write traffic and disk space. And use Slaves for excessive read traffic.

It is a never-ending pain to try to perfect "sync" scripts.