There is a new system keyspace called "system_schema", but there are only two tables in it:
Ok something must have gone wrong in your upgrade, because that's not right. When I check my system_schema 3.4, I see this:
[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
aploetz@cqlsh> use system_schema ;
aploetz@cqlsh:system_schema> desc tables;
tables triggers views keyspaces dropped_columns
functions aggregates indexes types columns
There are definitely more than two tables in that keyspace.
How do I query to see what keyspaces are available?
The new way to do this, is to query system_schema.keyspaces:
aploetz@cqlsh:system_schema> SELECT * FROM keyspaces;
keyspace_name | durable_writes | replication
------------------------+----------------+-------------------------------------------------------------------------------------
zeroreplication | True | {'DC1': '0', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system_auth | True | {'class': 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
experfy_class | True | {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
system | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
stackoverflow | True | {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
eqcontrol | True | {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
The main difference between system.schema_keyspaces and system_schema.keyspaces, is that system_schema.keyspaces only has 3 columns instead of two (strategy_class
and strategy_options
were combined into replication
).
Schema design in Cassandra, for efficient tables, will grate against your RDBMS experience; for efficiency, the Cassandra prefers denormalization, not normalization. By this, I mean that if you have some user information and you want to look up that data using two different primary keys, then using Cassandra, it actually is better to use two tables (and duplicate the data). Yes, this means more storage space, but it also allows for faster reads.
As a side note, based on my own experiences, I would recommend against using a secondary index, and instead simply use another table. Secondary indexes in Cassandra are treated a little differently, with background threads which update the indexes periodically; this makes reading from an index not quite as reliable (i.e. more likely to surprise you, in a not good way) than just using a table.
Thus I would recommend something like the following two tables for your needs:
CREATE TABLE users (
id TIMEUUID PRIMARY KEY,
user UUID,
friend UUID
);
CREATE TABLE friends (
id TIMEUUID,
user UUID,
friend UUID,
PRIMARY KEY (user, friend)
);
This second table would let you do your CQL query:
SELECT * FROM friends WHERE user = ?
Notice that this friends
table uses a compound primary key. This allows there to be multiple friend
values associated with that single user
value.
One of the downsides of this multiple-table approach is that your application code now has to be responsible for writing into both tables for a single "update", and you have to deal with any potential skew/reconciliation. Cassandra achieves its performance, in many ways, by avoiding enforcing of foreign key constraints and such and leaving that to the application.
Hope this helps!
Best Answer
Lets understand data distribution in multiple data center first. If you have two data-centers -- you basically have complete data in each data-center. And if you have set replication factor, say, 2 for each data-center -- this means each data-center will have 2 copies of the data. This is where token assignment to nodes comes into the picture as important factor to make sure no node is overburdened. See this image from datastax website:
You see
T0
andT3
are fatter than the rest. If you see whole cluster, it seems that data must be equidistributed... but that's not true. Each data-center can be seen as virtual ring within the ring. :) You see the images below? Now you realize thatT0
is carrying data range(T2, T0]
andT3
is loaded with(T5, T3]
! But the fact that all is within the data-center, on any sunny day, you would not need to go fetch from other data-center unless you haveALL
orEACH_QUORUM
(or a CL more than number of replica on the data-center) as your consistency level.So, to answer your question: In a multiple data-center setup, you will always have data in each data-center (assuming proper replication factor is setup). If you have a consistency level such that it require to read/write from other data-center to satisfy the consistency level, the coordinating node (the one your client is connected to) will talk to a node in other data-center. In EC2, each region is a data-center, each availability zone is treated as rack.