MySQL – Binlog Limitations

binlogMySQLmysqlbinlog

I have a MySQL instance running on AWS, with around 5000 inserts per second. Any idea what the performance impact will be if I use binlog (row) and a binlog tailer?

Check this link

From my understanding, a bin log tailer polls the MySQL binlog periodically in order to make a 'realtime' data connection possible. The binlog tailer runs in NodeJS.

The point is, I have to use MySQL and I want to use Meteor to get the data in realtime to my clients. Hence my idea of using this binlog tailer.

Because the MySQL db is gonna be filled quite intensively (5000 inserts per second) , I want to know at which point the binlog/binlog tailer gets severe performance issues.

Grts, Tom

Best Answer

I have developed software with similar functionality, the ability to use the MySQL replication stream (binary log, binlog) to capture events in near-real-time in response to inserts/updates/deletes in the database.

Here are some observations I've made regarding performance. Luckily, the potential hot spots are largely independent of each other.

I will assume, since I was unfamiliar with the Node package you cited and have only just now given their code a cursory review, that they are not actually "tailing" the binlog via polling, but are actually emulating a slave/replica server and connecting to the master and requesting the replication stream.

The first potential bottleneck is the master's ability to write the amount of Binlog data required (I/O throughput is the primary cobstraint). If your master is already logging in ROW format, then this problem is already solved. If not, then switch your Binlog format, and see. I prefer ROW format, anyway, because it's very useful for data recovery when queries go bad or the application does something to the data it shouldn't have. It is possible (using 3rd party tools) to capture what happened and reverse it -- in the default configuration, when a delete occurs (for example) the deleted data is actually written to the binary log.

The next point of resource consumption is the slave connection made by such a tool to the master, where the master pushes the data. A common misconception is that a slave "polls" the master. In fact, the slave initiates the connection, but the master pushes the data. This is actually a load on the master that has very little impact on performance when the number of slaves connected is small (say, 5 or fewer). This load can be eliminated from the master entirely by connecting the "binlog tailer" not to the master, but to an existing slave of the master, with log_slave_updates configured.

Transport of the data from the master to the pseudo-slave can eat significant network bandwidth, so your external utility should support the MySQL client/server compression protocol to reduce this bandwidth. Enabling this capability can achieve compression ratios of 10:1 depending on the payload.

The final pain point is the external utility itself. The MySQL Binlog format is a very tightly packed binary format (hence "binary log") that must be parsed and decoded. The efficiency with which the external utility can unpack and manipulate this data stream will determine how close to real-time the events detected can be emitted, since inefficient code will cause your decided event stream to lag further and further behind the master, though this factor won't have any performance impact on the master server itself.

In short, if your master can handle the workload of generating row-format binlogs for the volume of traffic you expect, the rest of the potential issues are still potential issues, but they should have no meaningful performance implications on the master server, itself.

Related Solutions

MySQL : Extract SQL statements from binlog

--base64-output=DECODE-ROWS --verbose

This is the format you need to be using if you are reviewing the binlog manually... it extracts the maximum amount of useful data from the log... but it will help if you understand how binary logging works in MySQL, first.

RDS (at least for MySQL 5.5 and 5.6) uses binlog_format = MIXED. This means the server may choose, on a query-by-query basis, to log either the actual SQL statement that was executed ("statement" log), or the actual rows changed by the query ("row" log) but not both.

The "commented SQL" that you are seeing is, as explained in the documentation, a human-readable pseudo-sql reconstruction of the actual data that was changed on the master by your query, for events logged using the ROW format. The body of the query that made the change is not preserved, because it is not needed for replication -- slave servers can replicate the exact same changes to their tables as occurred on the master, based on these before ("where") and after ("set") row images.

You can't "get" the original SQL from these events because it's not being saved to the binary log when events are written like this. In the example you posted, it's telling you that the value of the 3rd column changed.

###   @3=1373178370 # old value

###   @3=1378479295 # new value

This "pseudo-SQL" is commented out because it isn't actual valid SQL that a server could execute. It's eyeball-compatible-only. (For a server to replay these row events, it needs to see the BINLOG blocks, which are a base64-encoded version of the information that mysqlbinlog is decoding for you, here).

The column names are also not preserved in the binary log for row-based events, because since they should be the same on master and slave, that information is not needed.

(Edit: you mentioned RDS but re-reading your question, it seems that you may be trying to apply these logs to RDS rather than having retrieved them from RDS; retrieving logs from RDS is only possible with RDS/MySQL 5.6. The conclusion, then, is that the server that originated these logs is either logging in MIXED or ROW mode.)

Update: I see what you are doing now. To summarize the link you posted, you're trying to (1) take a point-in-time snapshot of an existing server with mysqldump, (2) restore that snapshot to an RDS instance, then (3) use mysqlbinlog to play the transactions forward from the binlog to sync the RDS server to your existing production server, and finally (4) migrate your application to RDS.

Bottom line: You are overthinking the nature of the problem and are concerned without cause. MySQL Server knows how to interpret the output of mysqlbinlog without either the --base64-output or --verbose flags and apply the changes it sees, whether or not the original query SQL is logged because when the query isn't logged, the actual row changes are logged.

When you run mysqlbinlog you see what looks almost like noise. Example:

#080828 15:03:08 server id 1  end_log_pos 442   Delete_rows: table id 17 flags: STMT_END_F
...
BINLOG '
fAS3SBMBAAAALAAAANoAAAAAABEAAAAAAAAABHRlc3QAAXQAAwMPCgIUAAQ=
fAS3SBcBAAAAKAAAAAIBAAAQABEAAAAAAAEAA//8AQAAAAVhcHBsZQ==
'/*!*/;

This is a base-64-encoded version of very tightly-packed binary "row images" containing the actual change that needs to be made to the database. A MySQL server knows how to decode and apply this change. Conversely, when you decode-rows and verbose, you're asking mysqlbinlog to decode the binary log for human eyeballs.

Leave your server's binlog_format setting as it is. The binary logs should work without a problem.

But now, forget everything you were planning because literally in the last 24 hours, Amazon has announced a much better way: they have opened up the ability to temporarily connect an RDS instance, as a slave, to a non-RDS master, to allow the RDS instance to be brought into real-time automatically prior to being promoted to master.

This is exactly the solution to what you are trying to accomplish and is much easier and less error prone.

To import data from a MySQL database that is not in Amazon RDS, you can configure replication from that database to an instance of MySQL running in Amazon RDS. The MySQL database you are migrating from can be running either on-premise in your data center, or on an Amazon EC2 instance. The MySQL instance in Amazon RDS must be running either version 5.5.33 or 5.6.13. Using replication to migrate the data reduces downtime.

^{— http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.NonRDSRepl.html}

No kidding.

As a DBA, I've been very reluctant to use RDS in my own production networks, preferring to retain full control over all aspects of the server... and by "reluctant," I mean "we didn't." I finally relented when Amazon made it possible in RDS for MySQL 5.6 to use the RDS instance as master to a non-RDS slave ... and, now, using an RDS master as slave to a legacy server to simplify migration seems like another positive sign for the viability of RDS.

MySQL – Slave Has Unmanageably Higher IO Load Than Master

Munin's formula for "device utilization" is (milliseconds spent doing I/O)/second, which assumes you can't do any I/O in parallel, so I'm not sure that this is a meaningful metric. However, you do clearly have a genuine performance issue here since replication can't stay caught up.

Is there a significant difference in the I/O subsystems between the servers?

The way I would approach this is to compare the I/O subsystems and then if there isn't a glaring difference in architecture, run some I/O benchmarking at a quiet time with a tool like bonnie++ or fio to narrow down the difference in performance.

Note that the slave likely has an equal write workload to the master because it has replay all the writes, plus it may have a considerable read workload (for reading any parts of the database that have to be updated) and might be less efficient for reads than the master because it has less RAM for caching. In this situation I would not necessarily expect the slave to be able to catch up because it's entirely possible for the master to generate work at a rate faster than the slave can consume it.

Best Answer

Related Solutions

MySQL : Extract SQL statements from binlog

MySQL – Slave Has Unmanageably Higher IO Load Than Master

Related Question