Cassandra – How to Alter Table and Change Compaction Strategy to DTCS

cassandra

I have cassandra table defined as:

CREATE TABLE db.table (
    value text,
    time timestamp,
    sid text,
    PRIMARY KEY (sid, time)
) WITH CLUSTERING ORDER BY (time ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

Table is pretty big, here are data files:

root@server:/home# ls -l /var/lib/cassandra/data/db/table/*-Data.db -h
-rw-r--r-- 1 cassandra cassandra 8.7G Aug 16 00:14 /var/lib/cassandra/data/db/table/db-table-ka-200785-Data.db
-rw-r--r-- 1 cassandra cassandra 2.2G Nov  5 21:28 /var/lib/cassandra/data/db/table/db-table-ka-208079-Data.db
-rw-r--r-- 1 cassandra cassandra  20G Nov 20 20:55 /var/lib/cassandra/data/db/table/db-table-ka-208702-Data.db
-rw-r--r-- 1 cassandra cassandra  18G Dec 17 17:31 /var/lib/cassandra/data/db/table/db-table-ka-210089-Data.db
-rw-r--r-- 1 cassandra cassandra 455M Dec 18 14:15 /var/lib/cassandra/data/db/table/db-table-ka-210153-Data.db
-rw-r--r-- 1 cassandra cassandra 4.4G Dec 27 09:27 /var/lib/cassandra/data/db/table/db-table-ka-210477-Data.db
-rw-r--r-- 1 cassandra cassandra 3.9G Jan  7 14:50 /var/lib/cassandra/data/db/table/db-table-ka-210834-Data.db
-rw-r--r-- 1 cassandra cassandra 4.1G Jan 16 04:47 /var/lib/cassandra/data/db/table/db-table-ka-211203-Data.db
-rw-r--r-- 1 cassandra cassandra 997M Jan 19 08:56 /var/lib/cassandra/data/db/table/db-table-ka-211292-Data.db
-rw-r--r-- 1 cassandra cassandra 1.1G Jan 20 23:24 /var/lib/cassandra/data/db/table/db-table-ka-211389-Data.db
-rw-r--r-- 1 cassandra cassandra 1.1G Jan 22 00:03 /var/lib/cassandra/data/db/table/db-table-ka-211479-Data.db
-rw-r--r-- 1 cassandra cassandra 285M Jan 22 11:33 /var/lib/cassandra/data/db/table/db-table-ka-211495-Data.db
-rw-r--r-- 1 cassandra cassandra  51M Jan 22 12:16 /var/lib/cassandra/data/db/table/db-table-ka-211500-Data.db
-rw-r--r-- 1 cassandra cassandra  16M Jan 22 12:31 /var/lib/cassandra/data/db/table/db-table-ka-211501-Data.db
-rw-r--r-- 1 cassandra cassandra  16M Jan 22 12:46 /var/lib/cassandra/data/db/table/db-table-ka-211502-Data.db

Table is user to store some metrics as time series. Every row inserted is set with TTL of one month.

  1. First thing I noticed here is existance of sstable much older than TTL, possibly because of not optimal compaction strategy for time series data. Is it safe to alter the table and change compaction strategy to DTCS now? What compaction subproperties should be best regarding TTL of one month?

  2. Since all data is inserted with the same TTL, I guess it's better to set default TTL on table. According to Datastax:

    Default TTL on a table is another natural fit
    for time-series data. It allows you to avoid costly compactions that
    run solely for the purpose of dropping tombstone and/or TTLed cells.
    Actually, it allows the compaction logic to be much more efficient,
    simply dropping the sstable when the newest cell timestamps are old
    enough to indicate all of the sstable is past its shelf life. This
    data is kept in the table metadata when it is written, making it easy
    to check. When default TTL is used in conjunction with a time-series
    workload, you can further reduce the compaction load associated with
    data density. When used in conjunction with DTCS, you can almost
    eliminate the geometric loads associated with high density on the
    compaction side of the system.

    Is it safe to do it now?

  3. Direction of clustering order is not optimal for getting last value.

    select * from table where sid='XXX' limit 1

    gives the first entry, not last.
    Is it possible to change clustering order to DESC, at this moment?

Best Answer

You can't, because that would require rewriting all your data on disk in a different order, while resorting at runtime until the rewrite was done, which would result in an unacceptable hit to performance. You'll need to create a new table and bulk-load it in.

Related Question