Why does Cassandra recommend against creating an index on high-cardinality columns

cassandraindex

The Cassandra documentation states,

Do not use an index in these situations:

  • On high-cardinality columns because you then query a huge volume of records for a small number of results. See Problems using a high-cardinality column index below.

It goes on,

If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index. For columns containing unique data, it is sometimes fine performance-wise to use an index for convenience, as long as the query volume to the table having an indexed column is moderate and not under constant load.

But never really answers the question: why is it inefficient? I have no idea what "manually maintaining the table as a form of an index" means. But then it somewhat contradicts itself with "…it is sometimes fine performance-wise to use an index for convenience as long as the query volume is moderate…"

Is this just trying to tell me to use the PK when and where I can? What's the inefficiency? My understanding is that a query that would hit an index would need to query every¹ node in the cluster, and then each node would do a lookup in its local index and the results would then get aggregated. This is not necessarily expensive (each index lookup should be fairly cheap) except that we pay in network latency, as we must wait for the slowest node of the lot. Am I missing anything here?

But if I have a collection that has a bajillion items that — on rare occasion — needs to be looked up by a different but almost unique attribute … this is an appropriate use, right?

¹Every? IDK if replication means that this can hit 1/3 of the cluster for a replication factor of 3 or not?

Best Answer

With a Cassandra index (i.e. a "secondary index", as opposed to primary keys), each node has to query its own local data for responding to a query (see the Cassandra secondary indexexes FAQ). These index are also built using a background process. This backgrounding means that the index may return false negatives in terms of hits (or false positives in terms of misses).

This means that in a high-cardinality column, the rate of change (i.e. additions/deletions) from that column can be quite high. And thus if that rate of change is faster than the updating of the index via the background process, then using an index is "inefficient" (the index is performing more work than is needed by the application, which might often get the wrong answer).

A more efficient approach, in terms of query accuracy, might be to maintain a second table, rather than a secondary index. Tables, as opposed to indexes, are treated just like any other table. They are more likely to give your application the query results it expects. The downside are that maintaining a table as an index, versus a Cassandra "secondary index", are now application constraints (i.e. your application code now has to know to insert/delete rows from that "index" table, and to keep the two tables in sync via application-level "reconciliation").

Hope this helps!