Cassandra nodetool repair scheduling and gc_grace_period

cassandra

I am trying to devise a scheme with minimal repair frequency to not overload the system, and have a problem to understand how can I ensure that the data is not getting resurrected.

I have a cassandra scheme with gc_grace_seconds specification of 48 hrs (172800 seconds)

I know it is recommended that the repair job [must be run before gc_grace expires to ensure deleted data is not resurrected][1]

I run the nodetool repair with the –pr key, so that it take less time to complete (it takes about 1.5 hours per node with the –pr option and around 6-7 hours per node without the –pr option).

Now, I don't see a scenario where I can be sure the data is not becoming visible again after the tombstone is invalidated.

In my understanding, if we create a tombstone at a date 1/1 00:00, it will be gone on 3/1 00:00.

If the repair job is scheduled to start every 2 days at the same times, the repair will start at 3/1 00:00 and until it will take care of the disappeared tombstone, we have a period that the data is resurrected.

If the repair job is scheduled to run, say, every day at the same times, then still we have the same problem – there will be a delay between the tombstone invalidation and nodetool repair action which will start at 00:00 every day.

If we make gc_grace_period more then 48 hours, say, 60 hours, then in my understanding the "resurrection period" will get even longer – the tombstone will get expired at 3/1 12:00 and "wait" till 4/1 00:00, even if we run the repair every day.

So how will the nodetool take care of the tombstones so that there is no period that the data is going back to life?

Best Answer

Found an answer to my own question on SO. There seems to be much more cassandra related questions then on DBA SE

https://stackoverflow.com/questions/32340429/what-does-cassandra-nodetool-repair-exactly-do

The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.

When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.

The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.

So it's sufficient that the tombstone will be delivered to other nodes before it expires on the original node, and that will complete when the repair runs on this node (as I think). I think because the nodetool takes snapshots before it runs, there is no chance that the tombstone will be gone in the middle of the repair and so will be lost.

It would be nice to test it though.

Related Solutions

How to Get nodetool Without Cassandra

The easiest (non-invasive) way is probably to download the tarball installation (you'll need to select either a Mac or Linux-based OS for it to allow you to download the tarball). Based-on your mention of disabling the service, I'm going to guess that you want to accomplish this on Windows. If that's not the case, please indicate so in the comments.

Un-tar dsc-cassandra-2.0.8-bin.tar.gz to the location you want to run Nodetool out of. ex:

$ cd /tools
$ tar -zxvf dsc-cassandra-2.0.8-bin.tar.gz

Note: You may have a different application you use for tarballs. I ran this from a Cygwin terminal.

Find the location of your JRE/JDK (not the bin directory) and set that as your "JAVA_HOME" (System) environment variable. When you have it set properly, you should be able to query it via CMD:

>echo %JAVA_HOME%
C:\Program Files (x86)\Java\jre7

Once you have JAVA_HOME set, it should work from either CMD or Powershell:

C:\tools\dsc-cassandra-2.0.8\bin>nodetool -h 192.168.1.85 status
Starting NodeTool
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: datacenter1
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns    Host ID                               Rack
UN  192.168.1.85  506.29 MB  256     100.0%  cd39f0fe-ed67-40cf-b6bd-504cedabf497  rack1

This way, you can run nodetool without messing with an installer or services.

Database Recommendation – Cassandra vs Hadoop vs ElasticSearch for Small Startups

If you foresee that moving to Cassandra is definitely in your future, it will be easier to do while your dataset is still small and manageable. Also, as you learn and get a feel for Cassandra, a small dataset is a better one to make mistakes on (and thus, easier to correct them). That way your data model is solid by the time your dataset gets big, and that's when it really matters. And IMHO, there is no such thing as "too small" for Cassandra.

One of our applications uses Cassandra and ElasticSearch in prod. Based on those experiences, I would offer some caution about using ElasticSearch as a persistent datastore. We've seen it lose writes fairly often. Have a read through this discussion on Quora, appropriately titled: Why should I NOT use ElasticSearch as my primary datastore? That being said, it works great as a search engine.

Best Answer

Related Solutions

How to Get nodetool Without Cassandra

Database Recommendation – Cassandra vs Hadoop vs ElasticSearch for Small Startups

Related Question