You say you are using pt-table-checksum 2.0.1. I would recommend updating to 2.1, as there are many improvements in the tool.
Next, let me address your test. You say the slave was not updated after the first or second commands that you ran. The second command looks to be trying to connect directly to the slave. pt-table-checksum won't report any differences unless the server you're connecting to has slaves.
Also, the --replicate-check-only
option will not do any checksumming. (from the docs):
If specified, pt-table-checksum doesn’t checksum any tables. It checks replicas for differences found by previous checksumming, and then exits.
Your first command doesn't seem to be able to connect to the slave host, which is why it doesn't report any differences. Make sure the user/pass that is connecting to the master can also connect to the slave.
Now, as for your complex setup, you are right to worry about breaking replication. With some slaves replicating only certain tables, you should heed the warning here:
If the replicas are configured with any filtering options, you should be careful not to checksum any databases or tables that exist on the master and not the replicas.
You can specify which databases you want to checksum with the --databases
option, and give a specific list of tables with the --tables
option. Alternatively you can use the --ignore-databases
and --ignore-tables
options to provide a list of databases/tables to not checksum.
This will probably mean you will want separate pt-table-checksum commands based on which slaves you are trying to checksum. You will probably have to use the 'dsn' --recursion-method
to accomplish this (I've never done it, personally)
As for load, pt-table-checksum comes with some options to throttle itself. Namely --max-load
and --max-lag
.
The tool keeps track of how quickly the server is able to execute the queries, and adjusts the chunks as it learns more about the server’s performance. It uses an exponentially decaying weighted average to keep the chunk size stable, yet remain responsive if the server’s performance changes during checksumming for any reason. This means that the tool will quickly throttle itself if your server becomes heavily loaded during a traffic spike or a background task, for example.
You're incorrect. It only runs the checksums on the master, not on the slaves.
Example query run:
REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT 'db', 'table', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `field1`, `field2`, `field3`, `field4`, `field5`, `field6`, `field7`)) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db`.`table` /*checksum table*/;
This works because it requires statement based replication, so the query is run fresh on every slave against their data, not the data the master found. This is documented on pt-table-checksums docs under LIMITATIONS
pt-table-checksum requires statement-based replication, and it sets
binlog_format=STATEMENT on the master, but due to a MySQL limitation
replicas do not honor this change. Therefore, checksums will not replicate
past any replicas using row-based replication that are masters for further replicas.
If you modify your data after the checksum query has replicated, of course it won't show that the data was changed, as it was correct when the checksum was calculated.
If you re-run the checksum after changing the data on the slave, then it should pick it up, assuming you're not using a n>2 tier replication setup.
re: Another hypothesis, it doesn't matter, the connection to the slave only matters for 'live' detection of mismatches, not for the checksums to actually calculate correctly. You shouldn't have false negatives, even with network interruptions.
Makes sense?
Best Answer
The issue is that the replicas don't report their ports to the master. By default pt-table-checksum tries to connect by filling unknowns with defaults; if the unknowns aren't using defaults, this will fail. In this case, there are two options:
1) Use the --recursion-method option to change from inspecting SHOW PROCESSLIST to inspecting SHOW SLAVE HOSTS, and configure slaves to report their host. I don't recommend this.
2) I recommend, instead, that you use --recursion-method=dsn and create a table full of DSN (data source name) instructions that tell the tool how to connect to each replica. If this table is foo.bar, then you will use --recursion-method=dsn=D=foo,t=bar.
This is only available in the 2.0 series of the toolkit, but you should be using that anyway, because pt-table-checksum is hugely improved in 2.0.