Cassandra nodetool repair scheduling and gc_grace_period

cassandra

I am trying to devise a scheme with minimal repair frequency to not overload the system, and have a problem to understand how can I ensure that the data is not getting resurrected.

I have a cassandra scheme with gc_grace_seconds specification of 48 hrs (172800 seconds)

I know it is recommended that the repair job [must be run before gc_grace expires to ensure deleted data is not resurrected][1]

I run the nodetool repair with the –pr key, so that it take less time to complete (it takes about 1.5 hours per node with the –pr option and around 6-7 hours per node without the –pr option).

Now, I don't see a scenario where I can be sure the data is not becoming visible again after the tombstone is invalidated.

In my understanding, if we create a tombstone at a date 1/1 00:00, it will be gone on 3/1 00:00.

If the repair job is scheduled to start every 2 days at the same times, the repair will start at 3/1 00:00 and until it will take care of the disappeared tombstone, we have a period that the data is resurrected.

If the repair job is scheduled to run, say, every day at the same times, then still we have the same problem – there will be a delay between the tombstone invalidation and nodetool repair action which will start at 00:00 every day.

If we make gc_grace_period more then 48 hours, say, 60 hours, then in my understanding the "resurrection period" will get even longer – the tombstone will get expired at 3/1 12:00 and "wait" till 4/1 00:00, even if we run the repair every day.

So how will the nodetool take care of the tombstones so that there is no period that the data is going back to life?

Best Answer

Found an answer to my own question on SO. There seems to be much more cassandra related questions then on DBA SE


https://stackoverflow.com/questions/32340429/what-does-cassandra-nodetool-repair-exactly-do

The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.

When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.

The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.


So it's sufficient that the tombstone will be delivered to other nodes before it expires on the original node, and that will complete when the repair runs on this node (as I think). I think because the nodetool takes snapshots before it runs, there is no chance that the tombstone will be gone in the middle of the repair and so will be lost.

It would be nice to test it though.