Cassandra in multiple data center – EC2MultiRegionSnitch

cassandra

I was reading article from : http://www.datastax.com/docs/1.0/cluster_architecture/replication#networktopologystrategy and I have a question about – EC2MultiRegionSnitch. I did not understand it well. According to article: if application requests data from node, but that node does not have data, but data is in another node in other data center, will application be able to get it?

Best Answer

Lets understand data distribution in multiple data center first. If you have two data-centers -- you basically have complete data in each data-center. And if you have set replication factor, say, 2 for each data-center -- this means each data-center will have 2 copies of the data. This is where token assignment to nodes comes into the picture as important factor to make sure no node is overburdened. See this image from datastax website:

enter image description here

You see T0 and T3 are fatter than the rest. If you see whole cluster, it seems that data must be equidistributed... but that's not true. Each data-center can be seen as virtual ring within the ring. :) You see the images below? Now you realize that T0 is carrying data range (T2, T0] and T3 is loaded with (T5, T3]! But the fact that all is within the data-center, on any sunny day, you would not need to go fetch from other data-center unless you have ALL or EACH_QUORUM (or a CL more than number of replica on the data-center) as your consistency level.

So, to answer your question: In a multiple data-center setup, you will always have data in each data-center (assuming proper replication factor is setup). If you have a consistency level such that it require to read/write from other data-center to satisfy the consistency level, the coordinating node (the one your client is connected to) will talk to a node in other data-center. In EC2, each region is a data-center, each availability zone is treated as rack.

Related Solutions

How to Query System Keyspaces After Migrating from Cassandra 2.x to 3.x

There is a new system keyspace called "system_schema", but there are only two tables in it:

Ok something must have gone wrong in your upgrade, because that's not right. When I check my system_schema 3.4, I see this:

[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
aploetz@cqlsh> use system_schema ;
aploetz@cqlsh:system_schema> desc tables;

tables     triggers    views    keyspaces  dropped_columns
functions  aggregates  indexes  types      columns

There are definitely more than two tables in that keyspace.

How do I query to see what keyspaces are available?

The new way to do this, is to query system_schema.keyspaces:

aploetz@cqlsh:system_schema> SELECT * FROM keyspaces;

 keyspace_name          | durable_writes | replication
------------------------+----------------+-------------------------------------------------------------------------------------
        zeroreplication |           True |       {'DC1': '0', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
            system_auth |           True | {'class': 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
          system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
          experfy_class |           True |       {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
                 system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
          stackoverflow |           True |       {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
              eqcontrol |           True |       {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}

The main difference between system.schema_keyspaces and system_schema.keyspaces, is that system_schema.keyspaces only has 3 columns instead of two (strategy_class and strategy_options were combined into replication).

Cassandra One-to-Many Table Design Best Practices

Schema design in Cassandra, for efficient tables, will grate against your RDBMS experience; for efficiency, the Cassandra prefers denormalization, not normalization. By this, I mean that if you have some user information and you want to look up that data using two different primary keys, then using Cassandra, it actually is better to use two tables (and duplicate the data). Yes, this means more storage space, but it also allows for faster reads.

As a side note, based on my own experiences, I would recommend against using a secondary index, and instead simply use another table. Secondary indexes in Cassandra are treated a little differently, with background threads which update the indexes periodically; this makes reading from an index not quite as reliable (i.e. more likely to surprise you, in a not good way) than just using a table.

Thus I would recommend something like the following two tables for your needs:

CREATE TABLE users (
  id TIMEUUID PRIMARY KEY,
  user UUID,
  friend UUID
);

CREATE TABLE friends (
  id TIMEUUID,
  user UUID,
  friend UUID,
  PRIMARY KEY (user, friend)
);

This second table would let you do your CQL query:

SELECT * FROM friends WHERE user = ?

Notice that this friends table uses a compound primary key. This allows there to be multiple friend values associated with that single user value.

One of the downsides of this multiple-table approach is that your application code now has to be responsible for writing into both tables for a single "update", and you have to deal with any potential skew/reconciliation. Cassandra achieves its performance, in many ways, by avoiding enforcing of foreign key constraints and such and leaving that to the application.

Hope this helps!

Best Answer

Related Solutions

How to Query System Keyspaces After Migrating from Cassandra 2.x to 3.x

Cassandra One-to-Many Table Design Best Practices

Related Question