Keeping rows ordered by counter value

cassandranosql

I'm storing linked words in a Cassandra v2 cluster and I'm having a hard time doing efficient reads. I'm trying to fetch the top X rows with the highest counter value.

Cassandra doesn't store the rows in any ordered way, so simply adding LIMIT X to the CQL query returns random-ish rows.
The ORDER BY command in CQL only works for columns that are (part of) the key.
OrderPreservingPartitioner doesn't exist anymore in Cassandra v2.

At the moment, when doing a read, I have to fetch all rows and manually select those X rows with the highest counter value. This literally slows some queries down by 100x, because even mildly popular words often have hundreds of thousands of words linked to them, not to mention the waste of bandwidth and other resources.

Are there any clever techniques to make this more efficient?

Best Answer

I'll try to address your concerns one at a time.

Cassandra doesn't store the rows in any ordered way, so simply adding LIMIT X to the CQL query returns random-ish rows.

The ORDER BY command in CQL only works for columns that are (part of) the key.

Not entirely correct, on either point. The on-disk sort order in Cassandra is determined by the first clustering column of your primary key. As a result, ORDER BY will also only work on that column (and only that column).

I'm trying to fetch the top X rows with the highest counter value.

Cassandra will not allow you to put a secondary index on a counter column, nor can you make it a part of your key (which also means you cannot sort by it). Therefore, querying data by the value of a counter column is not possible.

It would appear that your use case is not an appropriate fit for Cassandra. I would suggest solving this issue with a relational database or something else that would provide you with the necessary aggregation tools.

Related Solutions

A Key/Value store database

Are you familiar with the concept of a Key/Value Pair? Presuming you're familiar with Java or C# this is in the language as a map/hash/datatable/KeyValuePair (the last is in the case of C#)

The way it works is demonstrated in this little sample chart:

Color        Red
Age          18
Size         Large
Name         Smith
Title        The Brown Dog

Where you have a key (left) and a value (right) ... notice it can be a string, int, or the like. Most KVP objects allow you to store any object on the right, because it's just a value.

Since you'll always have a unique key for a particular object that you want to return, you can just query the database for that unique key and get the results back from whichever node has the object (this is why it's good for distributed systems, since there's other things involved like polling for the first n nodes to return a value that match other nodes returns).

Now my example above is very simple, so here's a slightly better version of the KVP

user1923_color    Red
user1923_age      18
user3371_color    Blue
user4344_color    Brackish
user1923_height   6' 0"
user3371_age      34

So as you can see the simple key generation is to put "user" the userunique number, an underscore and the object. Again, this is a simple variation, but I think we begin to understand that so long as we can define the part on the left and have it be consistently formatted, that we can pull out the value.

Notice that there's no restriction on the key value (ok, there can be some limitations, such as text-only) or on the value property (there may be a size restriction) but so far I've not had really complex systems. Let's try and go a little further:

app_setting_width      450
user1923_color         Red
user1923_age           18
user3371_color         Blue
user4344_color         Brackish
user1923_height        6' 0"
user3371_age           34
error_msg_457          There is no file %1 here
error_message_1        There is no user with %1 name
1923_name              Jim
user1923_name          Jim Smith
user1923_lname         Smith
Application_Installed  true
log_errors             1
install_path           C:\Windows\System32\Restricted
ServerName             localhost
test                   test
test1                  test
test123                Brackish
devonly
wonderwoman
value                  key

You get the idea... all those would be stored in one massive "table" on the distributed nodes (there's math behind it all) and you would just ask the distributed system for the value you need by name.

At the very least, that's my understanding of how it all works. I may have a few things wrong, but that's the basics.

obligatory wikipedia link http://en.wikipedia.org/wiki/Associative_array

Mongodb – Sharded key-value store using MongoDB

There are a number of reasons not to use MongoDB as a pure key-value store, and there are some reasons to consider it. Mongo is optimized as a document store - it indexes all the fields in a document, and has rich primitives for JSON objects and hierarchies. You can use it as a key-value store, but the single-threaded nature means you won't be getting good performance out of your hardware. Storing simple blobs removes a number of the benefits of Mongo. Mongo has algorithms where it splits data chunks as you insert, which can create lag. Monogo's system for re-partitioning is cumbersome, as well. The benefit of a key-value system is it should be really simple and really fast, so you can scale up and keep server and management costs down.

Other systems are more tuned for key-value use. You mention Redis, one of the best key-value stores, but the repartitioning/clustering in Redis is still alpha-level, and there is a requirement of DRAM. Some people build their own shard layers and partitioning layers on Redis - this is very common among some of the larger Chinese social networks.

Cassandra is sometimes used as a key-value store. This isn't the best use of Cassandra, as Cassandra's "super column families" provide rich indexing. Cassandra isn't as fast as databases written in C like Redis and Mongo, but does have strong clustering capabilities.

One store you should strongly consider in this area is Aerospike. Aerospike has very flexible cluster management - adding a single node by just bringing it up - as well as support for both DRAM and SSD/Flash - and easy replication for HA. It's in use at very high levels of scale by advertising platform companies who need huge key value stores. Aerospike has a free version that supports node sizes to 200G.

CoucheBase (was MemBase) is another system to look at for key-value use. It provides some clustering primitives, and is focused more around in-memory use.

Best Answer

Related Solutions

A Key/Value store database

Mongodb – Sharded key-value store using MongoDB

Related Question