Allocate space in Cassandra

cassandrascylladb

We run a Cassandra cluster with 2 DC, 3 nodes each (2 RF). The state of the cluster is quite bad (never repaired) and the disks were getting full so we added additional nodes which after the bootstrap procedure joined successfully the cluster.
According to the Cassandra documentation is supposed to run cleanup after adding a new node, but in our case:

Since we never run repair (and now it too late to run since disk usage in old nodes is > 90 %) Is it safe for the data to run:

nodetool cleanup

in each node in order to free space?

Cheers,

Jbrl

Best Answer

Simply put, nodetool cleanup will remove data from a node that it is no longer responsible for. This can happen when adding new nodes to a cluster, as they assume token ranges once held by other nodes. Once the data is streamed to the new node(s), it isn't automatically removed from the pre-existing nodes. This is intentional, in that if a new node fails to bootstrap no data is lost.

Running a cleanup will indeed remove potentially orphaned replicas. Normally, it is a good idea to ensure that repairs have been successfully completing prior to running nodetool cleanup.

In this case, as repair has never been run, I can't imagine that the application users really care about data consistency at this point. But one thing to try, might be to run cleanup on one node to get some space back, followed by repair on the same node. Work your way through the cluster running repair/cleanup (in that order) on each. That should minimize the risk of data loss, while allowing for restoration of some semblance of consistency.