What are the practical limitations on a column family in Cassandra

cassandrascalability

In Cassandra, it's not recommended to have more than a few thousand column families, let's say 2,000 for the sake of argument. In cases where more than 2,000 types of data need to be persisted, one approach would be to shard multiple unrelated types of data into each column family.

For example, a single CF could contain Orders, Invoices, and Customers, provided their row keys were distinct (e.g. prefixed with the object type, i.e. the keys of a single CF could include both Order|1234 and Customer|1234). A second CF could contain say Addresses, LineItems, and OrderTypes. Given the basic feasibility of this approach, what are the practical limits on it? For example, what would be wrong with putting all 10,000 types of object into a single CF? As far as I can tell from the Cassandra wiki, there is no hard limitation on the size of a CF.

Best Answer

I'm not a fan. It's about as good an idea as creating a relational table named OrdersOrCustomers with columns defined for both. The storage-engine penalty is slightly lower in Cassandra because of the sparse-cell storage under the hood, but it's still bad practice.

This bites you later when you want to map/reduce over your data; each task will have to scan over all your data, and filter out the rows that don't match what you're actually interested in (e.g., customers). And good luck making sense of the statistics that Cassandra tracks per-CF. ("Is this CF the source of 80% of my application reads because of the order data? Or because of the customer sessions it's combined with? Or the other five data types I threw in?")

If you absolutely positively need tens or hundreds of thousands of CFs? Even then I'd rather run Cassandra without arena allocation, than mutilate my data model like this.