Database – Which Can Handle Billions/Trillions of Records?

cassandramongodbnosqlrdbmssql server

We are looking at developing a tool to capture and analyze netflow data, of which we gather tremendous amounts of. Each day we capture about ~1.4 billion flow records which would look like this in json format:

{
   "tcp_flags": "0",
   "src_as": "54321",
   "nexthop": "1.2.3.4",
   "unix_secs": "1352234521",
   "src_mask": "23",
   "tos": "0",
   "prot": "6",
   "input": "105",
   "doctets": "186",
   "engine_type": "0",
   "exaddr": "2.3.4.5",
   "engine_id": "2",
   "srcaddr": "9.8.7.6",
   "dst_as": "12345",
   "unix_nsecs": "752265174",
   "sysuptime": "2943529544",
   "dst_mask": "24",
   "dstport": "80",
   "last": "2943523241",
   "srcport": "52672",
   "dpkts": "4",
   "output": "111",
   "dstaddr": "6.5.4.3",
   "first": "2943517993"
}

We would like to be able to do fast searches (less than 10 seconds) on the data set, most likely over narrow slices of time (10 – 30 mintes intervals). We also want to index the majority of the data points so we can do searches on each of them quickly. We would also like to have an up to date view of the data when searches are executed. It would be great to stay in the open source world, but we are not opposed to looking at proprietary solutions for this project.

The idea is to keep approximately one month of data, which would be ~43.2 billion records. A rough estimate that each record would contain about 480 bytes of data, would equate to ~18.7 terabytes of data in a month, and maybe three times that with indexes. Eventually we would like to grow the capacity of this system to store trillions of records.

We have (very basically) evaluated couchbase, cassandra, and mongodb so far as possible candidates for this project, however each proposes their own challenges. With couchbase the indexing is done at intervals and not during insertion of the data so the views are not up to date, cassandra's secondary indexes are not very efficient at returning results as they typically require scanning the entire cluster for results, and mongodb looks promising but appears to be far more difficult to scale as it is master/slave/sharded. Some other candidates we plan to evaluate are elasticsearch, mysql (not sure if this is even applicable), and a few column oriented relational databases. Any suggestions or real world experience would be appreciated.

Best Answer

In a company I work for we are dealing with similar amount of data (around 10 TBs of realtime searchable data). We solve this with Cassandra and I would like to mention couple of ideas that will allow you to do O(1) search on a multi TBs database. This is not specific to Cassandra db though, you can use it with other db as well.

Theory

  • Shard your data. There is no way a single server will reliably and realistically hold such volume of data.
  • Be ready for hardware faults and whole node failures, duplicate the data.
  • Start using many back-end servers from the beginning.
  • Use many cheaper commodity servers, compared to top-end high performance ones.
  • Make sure data is equally distributed across shards.
  • Spend a lot of time planning your queries. Derive API from the queries and then carefully design tables. This is the most important and prolonged task.
  • In Cassandra, you can design a composite column key and get access to that key in O(1). Spend time working on them. This will be used to access searchable records instead secondary index.
  • Make use of wide rows. They are useful for storing time-stamped events.
  • Never perform full-scan or in fact any operation more than O(Log N) on such volume. If you require anything more than O(Log N), offload such operations to Map-Reduce algorithms.

Practice

  • Don't spend time building OS images or installing servers on physical machines. Use cloud based providers for quick prototyping. I worked with Amazon EC2 and can highly recommend it for its simplicity, reliability and speed of prototyping.
  • Windows machines tend to be slower during boot time and take considerably more resources being in Idle state. Consider using Unix-based OS. Personally, I found Ubuntu server to be a reliable OS, but moreover there is a pretty good community at askubuntu
  • Think about networking, nodes shall ideally be close to each other to allow fast gossiping and meta-data exchange.
  • Do not go into extreme cases: really wide column rows or exceptionally long column families (tables). Best performance is achieved in the sane boundaries - if db supports that many N rows by design, it doesn't mean it performs well.
  • Our search takes about 3-5 seconds, much is due to the intermediate nodes between UI and the database. Consider how to bring requests closer to the database.
  • Use a network load balancer. Choose an established one. We use HAProxy, which is simple, but dead fast. Never had problems with it.
  • Prefer simplicity to complex solutions.
  • Look for free open-source solutions, unless you are backed up by a corporation's size budget. Once you go more than several servers, the costs of infrastructure might go sky high.

I do not work for Amazon and have no relations with HAProxy and Ubuntu teams. This is a personal opinion rather than any sort of promotion.