Mysql – Data Warehousing strategy

MySQL

I have a busy MySQL database with 1.6 million records a day, I'm looking to replicate this data to another server, then run a clean up job on production database to maintain small database in production.

I thought about Master / Slave setup, but I'm not sure how can I prevent deletes from replicating to slave.

I'm sure other BD admins / developers have run into same situation, I would appreciate any suggestion / recommendation

Thank you

Best Answer

I thought about Master / Slave setup, but I'm not sure how can I prevent deletes from replicating to slave.

You can't do that. Replication is intended to replicate as much as possible to keep the slaves as accurate copies of the master(s).

You might want to replicate to another machine so that you can have a copy to run the copy-to-archive archive process on, so that this does not impact performance of the application when being run, but that is optional (and not necessary if the process doesn't impact live performance much or you have a time window in which you don't care if it does (i.e. over-night if your application is predominantly used in one timezone so you have quiet periods when most users are sleeping)).

The most efficient way to update the archive/reporting database from the production one (either the main DB or a read-only replica), and clean up production once this is done, depends a lot on the structure of your data.

Related Solutions

MySQL 5.0 Replication – Addressing Bandwidth Concerns

SUGGESTION #1 : Use Distribution Masters

A Distribution Master is a mysql slave with log-bin enabled, log-slave-updates enabled and contains only tables with the BLACKHOLE Storage Engine. You can apply replicate-do-db to the Distribution Master and create binary logs at the Distribution Master that contains only the DB schema(s) you want binlogged. In this way you reduce the size of outgoing binlogs from the Distribution Master.

You can setup a Distribution Master as follows:

mysqldump your database(s) using --no-data option to generate a schema-only dump.
Load the schema-only dump to the Distribution Master.
Convert every table in the Distribution Master to the BLACKHOLE storage engine.
Setup replication to the Distribution Master from a master with real data.
Add replicate-do-db option(s) to /etc/my.cnf of the Distribution Master.

For steps 2 and 3 you could also edit the schema-only dump and replace ENGINE=MyISAM and ENGINE=InnoDB with ENGINE=BLACKHOLE and then load that edited schema-only dump into the Distribution Master.

In step 3 only, if you want to script the conversion of all MyISAM and InnoDB tables to BLACKHOLE in the Distribution Master, run the following query and output it to a text file:

mysql -h... -u... -p... -A --skip-column-names -e"SELECT CONCAT('ALTER TABLE ',table_schema,'.',table_name', ENGINE=BLACKHOLE;') BlackholeConversion FROM information_schema.tables WHERE table_schema NOT IN ('information_schema','mysql') AND engine <> 'BLACKHOLE'" > BlackholeMaker.sql

An added bonus to scripting the conversion of table to the BLACKHOLE storage engine is that MEMORY storage engine tables are converted as well. While MEMORY storage engine table do not take up disk space for data storage, it will take up memory. Converting MEMORY tables to BLACKHOLE will keep memory in the Distribution Master uncluttered.

As long as you do not send any DDL into the Distribution Master, you can transmit any DML (INSERT,UPDATE,DELETE) you so desire before letting clients replicate just the DB info they want.

I already wrote a post in another StackExchange site that discusses using a Distribution Master.

SUGGESTION #2 : Use Smaller Binary Logs and Relay Logs

If you set max_binlog_size to something ridiculously small, then binlogs can be collected and shipped out in smaller chunks. There is also a separate option to set the size of relay logs, max_relay_log_size. If max_relay_log_size = 0, it will default to whatever max_binlog_size is set to.

SUGGESTION #3 : Use Semisynchronous Replication (MySQL 5.5 only)

Setup your main database and multiple Distribution Masters as MySQL 5.5. Enable Semisynchronous Replication so that the main database can quickly ship binlogs to the Distribution Master. If ALL your slaves are Distribution Masters, you may not need Semisynchronous Replication or MySQL 5.5. If any of the slaves, other than Distribution Masters, have real data for reporting, high availability, passive standby or backup purposes, then go with MySQL 5.5 in conjunction with Semisynchronous Replication.

SUGGESTION #4 : Use Statement-Based Binary Logging NOT Row-Based

If an SQL statement updates multiple rows in a table, Statement-Based Binary Logging (SBBL) stores only the SQL statement. The same SQL statement using Row-Based Binary Logging (RBBL) will actual record the row change for each row. This makes it obvious that transmitting SQL statements will save space on binary logs doing SBBL over RBBL.

Another problem is using RBBL in conjunction with replicate-do-db where table name has the database prepended. This cannot be good for a slave, especially for a Distribution Master. Therefore, make sure all DML does not have a a database and a period in front of any table names.

How to Make MySQL Replication Reliable – InnoDB and MyISAM Tips

OBSERVATION #1

You mentioned Ask developers put all the temporary tables into a separated database

If your developers are using CREATE TEMPORARY TABLE commands to create temporary tables, they need to use CREATE TABLE instead. Here is why:

With MySQL Replication processing a temporary table, this is what occurs

1) Master run CREATE TEMPORARY TABLE
2) Command inserted into binary log
3) Replication copies this over to the Slave's Relay Logs via I/O Thread
4) Slave SQL Thread runs CREATE TEMPORARY TABLE
5) Slave processes data with that temp table

Once in a while, someone may run STOP SLAVE; to run a backup. If STOP SLAVE; is issued just after step 4, the temp created disappears and so does its data. When you run START SLAVE; Replication breaks instantly complaining the table does not exist. This is normal because when a DB Connections terminates deliberately or accidently, all temp tables opened using CREATE TEMPORARY TABLE in the DB session are dropped. Running STOP SLAVE; kill the SQL thread who was holding opening the temp table.

The only workaround for this is to create the table using CREATE TABLE instead of CREATE TEMPORARY TABLE. When run STOP SLAVE;, the temp table you created normally does not disappear.

I have seen this happen maybe 10 times in my DBA career. Fixing it using the binary logs to find out the name of the temp tables, to create those tables using CREATE TABLE, then starting replication up was the only maintenance possible without just brute force copying the master.

OBSERVATION #2

mk-table-sync only works on tables with primary keys and/or unique keys. It works maybe 99% of the time. I have seen instances where the checksum of a table on the master and slave were different. I would run mk-table-sync, there were still differences (Of course, I was doing mk-table-sync in circular replication with 3 masters, which can be a little dangerous. Using it in Master/Slave is far more stable)

OBSERVATION #3

You mentioned There is some unsafe queries. Does it get some problems with MIXED based replication?

It depends. The most popular unsafe query is any UPDATE or DELETE that uses ORDER BY ... LIMIT. With SBR, this could possibly cause MySQL to UPDATE or DELETE rows from a table on the Slave in a different order tham that of the Master. With RBR, I believe the exact changes in a row are more identifiable to UPDATE or DELETE on the Slave.

SOLUTION : Avoid using unsafe queries. Then, you will not worry !!!

OBSERVATION #4

I just read your second link. ROFL !!! I am familiar with the poster of the answer.