Cassandra One-to-Many Table Design Best Practices

cassandra

I am new to Cassandra from 20 years of RDBMS, and over the last few days have been reading and watching everything I can get my hands on that applies to my situation. I'm sure this is a basic question, but for some reason it's not clicking so please excuse me if this is answered elsewhere.

I am trying to design a table to store a list of associations between users. Any user may have any number of associated users. It's sort of like a friends list.

Right now I have 3 fields:

id (timeuuid) (PK)
user (uuid)
friend (uuid)

I want to be able to do:

SELECT * FROM friends WHERE user = ?

So because user is not a PK, it cannot be used in a query. And, if I make it part of the PK, it has to be unique, meaning users can only have 1 friend max.

I sort of solved this by using a secondary index on user, but I imagine that approach is not the best idea. The query is rather slow to respond, obviously because it has to ask all my cluster nodes for their part of the data.

So: what is the proper way to design this table? Thank you very much in advance for any guidance.

Best Answer

Schema design in Cassandra, for efficient tables, will grate against your RDBMS experience; for efficiency, the Cassandra prefers denormalization, not normalization. By this, I mean that if you have some user information and you want to look up that data using two different primary keys, then using Cassandra, it actually is better to use two tables (and duplicate the data). Yes, this means more storage space, but it also allows for faster reads.

As a side note, based on my own experiences, I would recommend against using a secondary index, and instead simply use another table. Secondary indexes in Cassandra are treated a little differently, with background threads which update the indexes periodically; this makes reading from an index not quite as reliable (i.e. more likely to surprise you, in a not good way) than just using a table.

Thus I would recommend something like the following two tables for your needs:

CREATE TABLE users (
  id TIMEUUID PRIMARY KEY,
  user UUID,
  friend UUID
);

CREATE TABLE friends (
  id TIMEUUID,
  user UUID,
  friend UUID,
  PRIMARY KEY (user, friend)
);

This second table would let you do your CQL query:

SELECT * FROM friends WHERE user = ?

Notice that this friends table uses a compound primary key. This allows there to be multiple friend values associated with that single user value.

One of the downsides of this multiple-table approach is that your application code now has to be responsible for writing into both tables for a single "update", and you have to deal with any potential skew/reconciliation. Cassandra achieves its performance, in many ways, by avoiding enforcing of foreign key constraints and such and leaving that to the application.

Hope this helps!