Apache Cassandra – How to Handle Dynamic Column Set Conflicts

cassandra

I know that each column in Apache Cassandra has a timestamp attached and that read conflicts for a single column is resolved deterministically by looking at timestamp or by comparing the value.

Let's say I add a column to a dynamic column set. I write this column to a single node. I then later add another column to the same column store, but this time to another node. How does Apache Cassandra merge these two? Will both columns exist after the merge?

Best Answer

The best way to think of a cassandra database is not as a set of databases on different nodes, but as one single database. Adding the column to the first node adds it to all other nodes. The actual number of times your data is written is determined by your replication strategy and replication factor, but each copy of the data will be the same.

Thus, if you tell the first node about a new column, the second node will automatically understand it and be able to access that data. If you tell the second node about different data in the same column, it will either overwrite the old data or add new data, depending on whether the row you're writing into already has data in it.

If you're adding a new column to the second node, then all the data in that column and the data in the first column will simultaneously exist and can be queried through either/any node.

Related Solutions

Cassandra – Best Practices for Row Ordering

I thought about another ColumnFamily "ArticlesByDateAndCountry" dynamic columns.

You're on the right track with this thought, but I'd stay away from dynamic columns. Currently, there isn't a way to really manage column families with dynamic columns in CQL3. So, the only real solution is to go the route of creating it in the cassandra-cli, which is being deprecated. Sticking with CQL tables gives you a much easier path to data access, and that in itself (IMO) makes it worth it. Besides, all the latest drivers work with CQL, so you're really backing yourself into a corner by choosing a path that can't be managed by it.

While it may not be readily apparent, there is a way to solve your current problem to adequately serve your queries. I would build a (CQL) table, like this:

CREATE TABLE ArticlesByDateAndCountry (
 countrycode text,
 articledate timestamp,
 field1 text,
 field2 text,
 PRIMARY KEY ((countrycode),articledate))
WITH CLUSTERING ORDER BY (articledate DESC);

Note: I have created two sample payload fields, field1 and field2. I'm sure your payload fields will vary. Also, I have opted to use a timestamp instead of a timeuuid, as it makes the example easier.

Essentially, this will group your data by countrycode. And within each countrycode, your data will be sorted by articledate.

SELECT articledate, field1, field2
FROM ArticlesByDateAndCountry 
WHERE countrycode='US' 
AND articledate >= '2015-01-23 00:00:00' AND articledate < '2015-01-24 00:00:00';

You should read Patrick McFadin's article Getting started with Cassandra time series data modeling. It has several examples that are quite similar to what you are doing here. While I have demonstrated this with the timestamp type, you could very easily make this work with time UUIDs instead. Here is a link to DataStax's documentation on Cassandra's timeuuid functions that I am sure you will find useful.

Modeling Graph Data in Cassandra DB

For Graph database on Cassandra have a look at TitanDB:

What you need is already implemented in TitanDB. Implementing your own Graph Database is not trivial, and would be very time consuming. In most cases, a proven solution is best. (BTW, I am not involved in TitanDB development or business.) I have no idea about your use case, but I do not see a reason to implement something new, except as a hobby.

Update I found a whitepaper about Titan GraphDB's data model in database: https://github.com/thinkaurelius/titan/wiki/Titan-Data-Model. It gives some hints how to design a datastore for graphs.

Aurelius is now also part of Datastax and they work on a combined solution for storing big graphs in Cassandra.

Best Answer

Related Solutions

Cassandra – Best Practices for Row Ordering

Modeling Graph Data in Cassandra DB

Related Question