How does the Cassandra distribute data

cassandra

I cannot find a detailed document on this anywhere, however, that being said, if anyone has an answer to my question, I would be extremely grateful.

In regards to Cassandra's standard node architecture, you are required as the administrator to divide the hash range by the number of nodes in the cluster, thus you have your tokens.

That being said, vnode tokens are generated automagically, and architecturally, I am curious as to how the system maintains these tokens.

To further specify my question for virtual nodes does the cluster re-evaluate tokens for all nodes just because a new host has been added?

To further this thought process, I am inclined to believe that it doesn't, as that would be a very expensive operation moving around any data that would then be in the wrong node due to the recalculation. Unfortunately, I cannot find out what happens…

Also, if two nodes can hold the same token, doesn't that break the architecture, or at very least doesn't it make designating replication factor almost pointless, as it should be written to all nodes that have the specified vtoken/token?

How is a vnode chosen for a particular token are the tokens assigned serially or randomly?

Lots of difficult questions, however I am counting on any Cassandra hanging around.

Thank you,

Scott

Best Answer

I initially just wanted to point at this as a comment, but my rep is too low. So excuse me if this is not a fully fledged answer. (Anyone feel free to edit or hint in comments)

Looking at the documentation at datastax it appears that data will be evently divided by nodes, depending on the "automagical" tokens. From my understanding it's simple as that a new vnode with new tokens just takes an even portion of each node, like all the other nodes already do. Hell, this is even what the doc says basically word by word:

Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster.

And to address your question:

To further specify my question for virtual nodes does the cluster re-evaluate tokens for all nodes just because a new host has been added?

Sort of. The new node itself takes some partitions (evenly) from all other nodes. If you remove a node every other node will rebalance its partitions from the other nodes, making up for the lost node.