I read Cassandra's documentation on the internal steps it performs when querying data. It looks like Cassandra relies on the Partitioner and Replication Strategy to process queries. I am still confused that the Partitioner needs to know the Partition Key. If the query has the Paritition Key, the internal query process looks straightforward. However, if the query expects a result set instead of a deterministic row like below.
SELECT * FROM <table>
-
In this case, when there is no Primary Key specified in the
WHERE
clause, how does the Coordinator know which nodes to send the requests to? -
If multiple rows are returned, which may be distributed in different nodes, how are these rows aggregated and returned to client?
Best Answer
Consider an unbound query run against a table named
crew
, with a partition key ofcrewname
. When I run the CQLtoken()
function on that key, you can see that the rows returned are indeed ordered by their token.It works this way, because Cassandra makes certain nodes primarily responsible for certain token ranges. It then becomes a simple task for the coordinator to return the result set in that order. If there multiple rows with the same partition key, the results will additionally be sorted by the clustering keys within each partition key.