We run a Cassandra cluster with 2 DC, 3 nodes each (2 RF). The state of the cluster is quite bad (never repaired) and the disks were getting full so we added additional nodes which after the bootstrap procedure joined successfully the cluster.
According to the Cassandra documentation is supposed to run cleanup after adding a new node, but in our case:
Since we never run repair (and now it too late to run since disk usage in old nodes is > 90 %) Is it safe for the data to run:
nodetool cleanup
in each node in order to free space?
Cheers,
Jbrl
Best Answer
Simply put,
nodetool cleanup
will remove data from a node that it is no longer responsible for. This can happen when adding new nodes to a cluster, as they assume token ranges once held by other nodes. Once the data is streamed to the new node(s), it isn't automatically removed from the pre-existing nodes. This is intentional, in that if a new node fails to bootstrap no data is lost.Running a
cleanup
will indeed remove potentially orphaned replicas. Normally, it is a good idea to ensure that repairs have been successfully completing prior to runningnodetool cleanup
.In this case, as
repair
has never been run, I can't imagine that the application users really care about data consistency at this point. But one thing to try, might be to runcleanup
on one node to get some space back, followed byrepair
on the same node. Work your way through the cluster runningrepair
/cleanup
(in that order) on each. That should minimize the risk of data loss, while allowing for restoration of some semblance of consistency.