What are the practical limitations on a column family in Cassandra

cassandrascalability

In Cassandra, it's not recommended to have more than a few thousand column families, let's say 2,000 for the sake of argument. In cases where more than 2,000 types of data need to be persisted, one approach would be to shard multiple unrelated types of data into each column family.

For example, a single CF could contain Orders, Invoices, and Customers, provided their row keys were distinct (e.g. prefixed with the object type, i.e. the keys of a single CF could include both Order|1234 and Customer|1234). A second CF could contain say Addresses, LineItems, and OrderTypes. Given the basic feasibility of this approach, what are the practical limits on it? For example, what would be wrong with putting all 10,000 types of object into a single CF? As far as I can tell from the Cassandra wiki, there is no hard limitation on the size of a CF.

Best Answer

I'm not a fan. It's about as good an idea as creating a relational table named OrdersOrCustomers with columns defined for both. The storage-engine penalty is slightly lower in Cassandra because of the sparse-cell storage under the hood, but it's still bad practice.

This bites you later when you want to map/reduce over your data; each task will have to scan over all your data, and filter out the rows that don't match what you're actually interested in (e.g., customers). And good luck making sense of the statistics that Cassandra tracks per-CF. ("Is this CF the source of 80% of my application reads because of the order data? Or because of the customer sessions it's combined with? Or the other five data types I threw in?")

If you absolutely positively need tens or hundreds of thousands of CFs? Even then I'd rather run Cassandra without arena allocation, than mutilate my data model like this.

Related Solutions

Cassandra Database Design – Penalties of Using Many Column Families or Keyspaces

Cassandra 1.0 uses a minimum of 1MB of heap per CF. So, 1000 or 2000 CFs will be okay for typical heap sizes, but 10000 will probably not be. JVM GC does poorly with very large heaps; I recommend staying under 8GB.

Cassandra – Query a column with collection type

You can index collection types in cassandra 2.1 and later. You are after:
SELECT * FROM <table> WHERE <field> CONTAINS <value_in_list/map/set>

Detailed example:

cqlsh> USE ks;
cqlsh:ks> CREATE TABLE data_points (
            id text PRIMARY KEY,
            created_at timestamp,
            previous_event_id varchar,
            properties map<text,text>
         );
cqlsh:ks> create index on data_points (properties);
cqlsh:ks> INSERT INTO data_points (id, properties) VALUES ('1', { 'fruit' : 'apple', 'band' : 'Beatles' });
cqlsh:ks> INSERT INTO data_points (id, properties) VALUES ('2', { 'fruit' : 'cherry', 'band' : 'Beatles' });
cqlsh:ks> SELECT * FROM data_points WHERE properties CONTAINS 'Beatles';

 id | created_at | previous_event_id | properties
----+------------+-------------------+----------------------------------------
  2 |       null |              null | {'band': 'Beatles', 'fruit': 'cherry'}
  1 |       null |              null |  {'band': 'Beatles', 'fruit': 'apple'}

(2 rows)

Word of warning, secondary indexes don't scale out well as they use a scatter/gather algorithm to find what you need, if you plan to use them for heavy tagging it might be better to denormalize the properties field int a separate table and carry out multiple queries.

Best Answer

Related Solutions

Cassandra Database Design – Penalties of Using Many Column Families or Keyspaces

Cassandra – Query a column with collection type

Related Question