Cassandra – Querying Without a Partition Key

cassandra

I read Cassandra's documentation on the internal steps it performs when querying data. It looks like Cassandra relies on the Partitioner and Replication Strategy to process queries. I am still confused that the Partitioner needs to know the Partition Key. If the query has the Paritition Key, the internal query process looks straightforward. However, if the query expects a result set instead of a deterministic row like below.

SELECT * FROM <table>

In this case, when there is no Primary Key specified in the WHERE clause, how does the Coordinator know which nodes to send the requests to?
If multiple rows are returned, which may be distributed in different nodes, how are these rows aggregated and returned to client?

Best Answer

when there is no Primary Key specified in the WHERE clause, how does the Coordinator know which nodes to send the requests to?

It doesn't. The (node chosen as the) coordinator has to scan all rows for that table on each and every node. That's why unbound queries are considered to be an anti-pattern in Cassandra, as they incur a lot of network time. Especially in larger clusters. Also, the coordinator will have to do extra work as it has to assemble and return the result set.

If multiple rows are returned, which may be distributed in different nodes, how are these rows aggregated and returned to client?

They are not really so much aggregated, as they are returned in order by the hashed token value of their partition key.

Consider an unbound query run against a table named crew, with a partition key of crewname. When I run the CQL token() function on that key, you can see that the rows returned are indeed ordered by their token.

aploetz@cqlsh:presentation> SELECT crewname,token(crewname),firstname,lastname 
FROM crew;

 crewname | token(crewname)      | firstname | lastname
----------+----------------------+-----------+-----------
    Simon | -8694467316808994943 |     Simon |       Tam
    Jayne | -3415298744707363779 |     Jayne |      Cobb
     Wash |   596395343680995623 |     Hoban | Washburne
      Mal |  4016264465811926804 |   Malcolm |  Reynolds
     Zoey |  7853923060445977899 |      Zoey | Washburne
 Sheppard |  8386579365973272775 |    Derial |      Book

(6 rows)

It works this way, because Cassandra makes certain nodes primarily responsible for certain token ranges. It then becomes a simple task for the coordinator to return the result set in that order. If there multiple rows with the same partition key, the results will additionally be sorted by the clustering keys within each partition key.

Related Solutions

How to Get nodetool Without Cassandra

The easiest (non-invasive) way is probably to download the tarball installation (you'll need to select either a Mac or Linux-based OS for it to allow you to download the tarball). Based-on your mention of disabling the service, I'm going to guess that you want to accomplish this on Windows. If that's not the case, please indicate so in the comments.

Un-tar dsc-cassandra-2.0.8-bin.tar.gz to the location you want to run Nodetool out of. ex:

$ cd /tools
$ tar -zxvf dsc-cassandra-2.0.8-bin.tar.gz

Note: You may have a different application you use for tarballs. I ran this from a Cygwin terminal.

Find the location of your JRE/JDK (not the bin directory) and set that as your "JAVA_HOME" (System) environment variable. When you have it set properly, you should be able to query it via CMD:

>echo %JAVA_HOME%
C:\Program Files (x86)\Java\jre7

Once you have JAVA_HOME set, it should work from either CMD or Powershell:

C:\tools\dsc-cassandra-2.0.8\bin>nodetool -h 192.168.1.85 status
Starting NodeTool
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: datacenter1
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns    Host ID                               Rack
UN  192.168.1.85  506.29 MB  256     100.0%  cd39f0fe-ed67-40cf-b6bd-504cedabf497  rack1

This way, you can run nodetool without messing with an installer or services.

Cassandra – Cluster Monitoring

You can use check Java Heap memory for each node. Like Total Java Heap Memory and Using Java Heap Memory.
One most important is CPU Utilization for each node.
Set alerts for errors. system.log have lots of information about.
You may set alerts for data disk and log disk.
Heartbeat check of servers like if you dont receive in few min it will get alerts to you.
Also dropped mutations and hinted hindoff clear alerts.

Basically you need to start observe system.log and will get more and more error for monitoring.

Best Answer

Related Solutions

How to Get nodetool Without Cassandra

Cassandra – Cluster Monitoring

Related Question