Cassandra in multiple data center – EC2MultiRegionSnitch

cassandra

I was reading article from : http://www.datastax.com/docs/1.0/cluster_architecture/replication#networktopologystrategy and I have a question about – EC2MultiRegionSnitch. I did not understand it well. According to article: if application requests data from node, but that node does not have data, but data is in another node in other data center, will application be able to get it?

Best Answer

Lets understand data distribution in multiple data center first. If you have two data-centers -- you basically have complete data in each data-center. And if you have set replication factor, say, 2 for each data-center -- this means each data-center will have 2 copies of the data. This is where token assignment to nodes comes into the picture as important factor to make sure no node is overburdened. See this image from datastax website:

enter image description here

You see T0 and T3 are fatter than the rest. If you see whole cluster, it seems that data must be equidistributed... but that's not true. Each data-center can be seen as virtual ring within the ring. :) You see the images below? Now you realize that T0 is carrying data range (T2, T0] and T3 is loaded with (T5, T3]! But the fact that all is within the data-center, on any sunny day, you would not need to go fetch from other data-center unless you have ALL or EACH_QUORUM (or a CL more than number of replica on the data-center) as your consistency level.

So, to answer your question: In a multiple data-center setup, you will always have data in each data-center (assuming proper replication factor is setup). If you have a consistency level such that it require to read/write from other data-center to satisfy the consistency level, the coordinating node (the one your client is connected to) will talk to a node in other data-center. In EC2, each region is a data-center, each availability zone is treated as rack.