How to Query System Keyspaces After Migrating from Cassandra 2.x to 3.x

cassandramigrationsystem-tables

I just migrated from Cassandra 2.x to 3.x (specifically, 3.4). Now the system tables have completely changed, and these changes are not in the documentation. If you look at the current CQL docs for system tables, they only describe the system schema as it was in 2.X.

CQL 3.3 Docs for Cassandra 2.X and greater

Does anyone out there have access docs that detail this change? How do I query to see what keyspaces are available?

There is a new system keyspace called "system_schema", but there are only two tables in it: system_schema.tables and system_schema.columns. So I'm at a loss as to where to look for the information I need (I need to be able to effectively do a DESCRIBE KEYSPACES command, but outside of cqlsh).

Is the answer this: SELECT keyspace_name FROM system_schema.tables; the new way to do this?

Best Answer

There is a new system keyspace called "system_schema", but there are only two tables in it:

Ok something must have gone wrong in your upgrade, because that's not right. When I check my system_schema 3.4, I see this:

[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
aploetz@cqlsh> use system_schema ;
aploetz@cqlsh:system_schema> desc tables;

tables     triggers    views    keyspaces  dropped_columns
functions  aggregates  indexes  types      columns

There are definitely more than two tables in that keyspace.

How do I query to see what keyspaces are available?

The new way to do this, is to query system_schema.keyspaces:

aploetz@cqlsh:system_schema> SELECT * FROM keyspaces;

 keyspace_name          | durable_writes | replication
------------------------+----------------+-------------------------------------------------------------------------------------
        zeroreplication |           True |       {'DC1': '0', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
            system_auth |           True | {'class': 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
          system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
          experfy_class |           True |       {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
                 system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
          stackoverflow |           True |       {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
              eqcontrol |           True |       {'DC1': '1', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}

The main difference between system.schema_keyspaces and system_schema.keyspaces, is that system_schema.keyspaces only has 3 columns instead of two (strategy_class and strategy_options were combined into replication).

Related Solutions

NoSQL – What is Unstructured Data?

There are a couple of concepts which need to be distinguished. One is about structure and the other about schema.

Structured data is one where the application knows in advance the meaning of each byte it receives. A good example is measurements from a sensor. In contrast a Twitter stream is unstructured. Schema is about how much of the structure is communicated to the DBMS as how it is asked to enforce this. It controls how much the DBMS parses the data it stores. A schema-required DBMS such as SQL Server can store unparsed data (varbinary) or optionally-parsed data (xml) and fully parsed data (columns).

NoSQL DBMSs lie on a spectrum from no parsing (key-value stores) upwards. Cassandra offers reatively rich functionality in this respect. Where they differ markedly to relational stores is in the uniformity of the data. Once a table is defined only data which matches that definition may be held there. In Cassandra, however, even if columns and families are defined there is no requirement for any two rows in the same table to look anything like each other. It falls to the application designer to decide how much goes in a single row (also referred to as a document) and what is held separately, linked by pointers. In effect, how much denormalisation do you want.

The advantage is you can retrieve a full set of data with a single sequential read. This is fast. One downside is that you, the application programer, are now solely responsible for all data integrity and backward compatibility concerns, for ever, for every bit of code that ever touches this data store. That can be difficult to get right. Also, you are locked into one point of view on the data. If you key your rows by order number, how do you report on the sale on one particular product, or region, or customer?

Optimal Database for Large Number of Columns or Rows

PostgreSQL

@Dobob everything can be normalized :) If it's not very secret, what sort of data it is? (Asking as a geneticist-turned-DBA.)

This is generally true. And you should start at looking how you can normalize. Clearly you're averse to the idea, and you can't always normalize. For instance, a raster is often a collection of readings from a scientific instrument, and you can't really denormalize that. They're taken at the same time, represent the same sample and frequently have 1,000s of data points. (Though now you can store them in PostGIS).

Solution for Problem 1

PostgreSQL does not support 16,383 columns on a table (or more than 1,600+ for that matter), but it does supports SQL Arrays. You can probably make use of that,

CREATE TABLE foo AS
  SELECT
    r AS id,
    array_agg(c) AS genetic_stuff
  FROM generate_series(1,567)
    AS r
  CROSS JOIN LATERAL generate_series(1,16382)
    AS c
  GROUP BY r;

Now you have a table that meets the first criteria but isn't insane.

          Table "public.foo"
    Column     |   Type    | Modifiers 
---------------+-----------+-----------
 id            | integer   | 
 genetic_stuff | integer[] |

Finding average is simple..

SELECT id, avg(x)
FROM foo
CROSS JOIN LATERAL unnest(genetic_stuff)
  AS t(x)
GROUP BY id;

And for that whole operation Execution time: 3865.308 ms! Can complain about that. Feel free to use a MATERIALIZED VIEW if you want to cache that average.

For Standard Deviation just use the appropriate aggregate function in the place of avg(). Execution times are about the same.

I'm not sure how you're querying this here, but without an index (worst case situation) is pretty simple and fast (considering it's a seq scan) scanning through an array with 16,000 items isn't slow either for a modern a CPU, and on PostgreSQL 9.6 this seq scan should even run in parallel (afaik).

SELECT * FROM foo
WHERE genetic_stuff @> ARRAY[5];

Can you use an index? Sure. From the intarray module,

Two GiST index operator classes are provided: gist__int_ops (used by default) is suitable for small- to medium-size data sets, while gist__intbig_ops uses a larger signature and is more suitable for indexing large data sets (i.e., columns containing a large number of distinct array values). The implementation uses an RD-tree data structure with built-in lossy compression.

Let's give it a shot...

CREATE INDEX ON foo
  USING gist(genetic_stuff gist__intbig_ops);
VACUUM FULL foo;

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Index Scan using foo_genetic_stuff_idx on foo  (cost=0.14..8.16 rows=1 width=22) (actual time=0.119..47.846 rows=567 loops=1)
   Index Cond: (genetic_stuff @> '{5}'::integer[])
 Planning time: 0.072 ms
 Execution time: 47.948 ms
(4 rows)

Not sure what more you want than index scan on that. Probably be more useful if not every row had an element of 5, but you know it's only sample data.

Anyway have at it.

Solution for Problem 2

Shy of that, the other thing isn't a problem at all a bit of overhead because Pg rows are fat, but you're not talking many rows and ultimately it shouldn't matter this below table meeting that criteria is only 5088 kB on my system. Hardly worth concerning yourself with.

CREATE TABLE baz
AS
  SELECT
    x AS id,
    x*2 AS x2,
    x*3 AS x3,
    x*4 AS x4
  FROM generate_series(1,117493)
    AS t(x);