Cassandra goes down after some some days

cassandra

I am running cassandra in 3 node cluster with replication factor 3. It was working fine for 2 days and all of a sudden it went down. I am beginner to cassandra database. I went through documentation to understand why it is happening . I could not get it .Below are the logs.

INFO  [main] 2020-02-17 11:52:34,270 CassandraDaemon.java:473 - Hostname: QCS7
INFO  [main] 2020-02-17 11:52:34,271 CassandraDaemon.java:480 - JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.8.0_152
INFO  [main] 2020-02-17 11:52:34,274 CassandraDaemon.java:481 - Heap size: 998.438MiB/998.438MiB
INFO  [main] 2020-02-17 11:52:34,275 CassandraDaemon.java:486 - Code Cache Non-heap memory: init = 2555904(2496K) used = 4815808(4702K) committed = 4849664(4736K) max = 251658240(245760K)
INFO  [main] 2020-02-17 11:52:34,275 CassandraDaemon.java:486 - Metaspace Non-heap memory: init = 0(0K) used = 18448024(18015K) committed = 19005440(18560K) max = -1(-1K)
INFO  [main] 2020-02-17 11:52:34,275 CassandraDaemon.java:486 - Compressed Class Space Non-heap memory: init = 0(0K) used = 2277864(2224K) committed = 2490368(2432K) max = 1073741824(1048576K)
INFO  [main] 2020-02-17 11:52:34,275 CassandraDaemon.java:486 - Par Eden Space Heap memory: init = 214827008(209792K) used = 98821656(96505K) committed = 214827008(209792K) max = 214827008(209792K)
INFO  [main] 2020-02-17 11:52:34,276 CassandraDaemon.java:486 - Par Survivor Space Heap memory: init = 26804224(26176K) used = 0(0K) committed = 26804224(26176K) max = 26804224(26176K)
INFO  [main] 2020-02-17 11:52:34,276 CassandraDaemon.java:486 - CMS Old Gen Heap memory: init = 805306368(786432K) used = 0(0K) committed = 805306368(786432K) max = 805306368(786432K)

INFO  [pool-3-thread-1] 2020-02-17 11:53:24,320 AutoSavingCache.java:262 - Harmless error reading saved cache /var/lib/cassandra/saved_caches/KeyCache-e.db
java.io.IOException: Corrupted key cache. Key length of 83886081 is longer than maximum of 65535
    at org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:481) ~[apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:219) ~[apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.cache.AutoSavingCache$3.call(AutoSavingCache.java:164) [apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.cache.AutoSavingCache$3.call(AutoSavingCache.java:160) [apache-cassandra-3.11.2.jar:3.11.2]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_152]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_152]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_152]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_152]
INFO  [pool-3-thread-1] 2020-02-17 11:53:24,320 AutoSavingCache.java:173 - Completed loading (21 ms; 560 keys) KeyCache cache

INFO  [main] 2020-02-17 11:55:35,746 MessagingService.java:753 - Starting Messaging Service on /172.30.55.233:7000 (br-provider)
INFO  [main] 2020-02-17 11:55:35,827 OutboundTcpConnection.java:108 - OutboundTcpConnection using coalescing strategy DISABLED
WARN  [main] 2020-02-17 11:56:06,859 Gossiper.java:1425 - Unable to gossip with any seeds but continuing since node is in its own seed list
INFO  [main] 2020-02-17 11:56:06,863 StorageService.java:707 - Loading persisted ring state
INFO  [main] 2020-02-17 11:56:06,869 StorageService.java:825 - Starting up server gossip
INFO  [main] 2020-02-17 11:56:07,546 TokenMetadata.java:479 - Updating topology for /172.30.55.233
INFO  [main] 2020-02-17 11:56:07,546 TokenMetadata.java:479 - Updating topology for /172.30.55.233

Best Answer

This seems to be an issue with the values for column_index_cache_size_in_kb and key_cache_size_in_mb in your cassandra.yaml file. As explained here, the database will stop when the total size of all the index data for a partition is greater than the value defined in column_index_cache_size_in_kb.

If the tuning of these values don't fix the issue, you may consider using larger instances for the nodes