Cassandra Backup – From All Nodes or Just One?

cassandra

Cassandra is a distributed database, where each node is in sync with the other nodes from the same ring/cluster.

When taking backups based on a snapshot, do I need to back up each node individually or is one enough?

The docs say:

To take a global snapshot, run the nodetool snapshot command using a
parallel ssh utility, such as pssh.

Am I missing a point here?

Best Answer

the short answer -> you have to take snapshots on all nodes.

As you pointed out, Cassandra is a distributed database. As an example, suppose you have 3 nodes with a replication factor (RF) of 2. Each node has primary responsibility for 1/3 of all the tokens in the ring. In addition, each node has a replica from the another node, and nodetool status will show "Owns 66.6%" (2 replicas / 3 nodes).

If you only backup one node, you only get the data on that node plus whatever replicas are being stored on that node. Since the data is distributed, you will end up missing some data unless you take snapshot on all nodes.

Related Solutions

How to predict new node allocation in a Cassandra cluster

Be careful about decreasing the amount of free disk space. The recommendation for cassandra 2.1 is to maintain 50-80% free space. You can read about it on the Datastax site. The free space is needed during compaction as the SStables will be streamed to disk. Depending on your compaction strategy, you can determine your disk capacity.

Cassandra One-to-Many Table Design Best Practices

Schema design in Cassandra, for efficient tables, will grate against your RDBMS experience; for efficiency, the Cassandra prefers denormalization, not normalization. By this, I mean that if you have some user information and you want to look up that data using two different primary keys, then using Cassandra, it actually is better to use two tables (and duplicate the data). Yes, this means more storage space, but it also allows for faster reads.

As a side note, based on my own experiences, I would recommend against using a secondary index, and instead simply use another table. Secondary indexes in Cassandra are treated a little differently, with background threads which update the indexes periodically; this makes reading from an index not quite as reliable (i.e. more likely to surprise you, in a not good way) than just using a table.

Thus I would recommend something like the following two tables for your needs:

CREATE TABLE users (
  id TIMEUUID PRIMARY KEY,
  user UUID,
  friend UUID
);

CREATE TABLE friends (
  id TIMEUUID,
  user UUID,
  friend UUID,
  PRIMARY KEY (user, friend)
);

This second table would let you do your CQL query:

SELECT * FROM friends WHERE user = ?

Notice that this friends table uses a compound primary key. This allows there to be multiple friend values associated with that single user value.

One of the downsides of this multiple-table approach is that your application code now has to be responsible for writing into both tables for a single "update", and you have to deal with any potential skew/reconciliation. Cassandra achieves its performance, in many ways, by avoiding enforcing of foreign key constraints and such and leaving that to the application.

Hope this helps!

Best Answer

Related Solutions

How to predict new node allocation in a Cassandra cluster

Cassandra One-to-Many Table Design Best Practices

Related Question