Schema design in Cassandra, for efficient tables, will grate against your RDBMS experience; for efficiency, the Cassandra prefers denormalization, not normalization. By this, I mean that if you have some user information and you want to look up that data using two different primary keys, then using Cassandra, it actually is better to use two tables (and duplicate the data). Yes, this means more storage space, but it also allows for faster reads.
As a side note, based on my own experiences, I would recommend against using a secondary index, and instead simply use another table. Secondary indexes in Cassandra are treated a little differently, with background threads which update the indexes periodically; this makes reading from an index not quite as reliable (i.e. more likely to surprise you, in a not good way) than just using a table.
Thus I would recommend something like the following two tables for your needs:
CREATE TABLE users (
id TIMEUUID PRIMARY KEY,
user UUID,
friend UUID
);
CREATE TABLE friends (
id TIMEUUID,
user UUID,
friend UUID,
PRIMARY KEY (user, friend)
);
This second table would let you do your CQL query:
SELECT * FROM friends WHERE user = ?
Notice that this friends
table uses a compound primary key. This allows there to be multiple friend
values associated with that single user
value.
One of the downsides of this multiple-table approach is that your application code now has to be responsible for writing into both tables for a single "update", and you have to deal with any potential skew/reconciliation. Cassandra achieves its performance, in many ways, by avoiding enforcing of foreign key constraints and such and leaving that to the application.
Hope this helps!
Best Answer
the short answer -> you have to take snapshots on all nodes.
As you pointed out, Cassandra is a distributed database. As an example, suppose you have 3 nodes with a replication factor (RF) of 2. Each node has primary responsibility for 1/3 of all the tokens in the ring. In addition, each node has a replica from the another node, and
nodetool status
will show "Owns 66.6%" (2 replicas / 3 nodes).If you only backup one node, you only get the data on that node plus whatever replicas are being stored on that node. Since the data is distributed, you will end up missing some data unless you take snapshot on all nodes.