Database – Which Can Handle Billions/Trillions of Records?

cassandramongodbnosqlrdbmssql server

We are looking at developing a tool to capture and analyze netflow data, of which we gather tremendous amounts of. Each day we capture about ~1.4 billion flow records which would look like this in json format:

{
   "tcp_flags": "0",
   "src_as": "54321",
   "nexthop": "1.2.3.4",
   "unix_secs": "1352234521",
   "src_mask": "23",
   "tos": "0",
   "prot": "6",
   "input": "105",
   "doctets": "186",
   "engine_type": "0",
   "exaddr": "2.3.4.5",
   "engine_id": "2",
   "srcaddr": "9.8.7.6",
   "dst_as": "12345",
   "unix_nsecs": "752265174",
   "sysuptime": "2943529544",
   "dst_mask": "24",
   "dstport": "80",
   "last": "2943523241",
   "srcport": "52672",
   "dpkts": "4",
   "output": "111",
   "dstaddr": "6.5.4.3",
   "first": "2943517993"
}

We would like to be able to do fast searches (less than 10 seconds) on the data set, most likely over narrow slices of time (10 – 30 mintes intervals). We also want to index the majority of the data points so we can do searches on each of them quickly. We would also like to have an up to date view of the data when searches are executed. It would be great to stay in the open source world, but we are not opposed to looking at proprietary solutions for this project.

The idea is to keep approximately one month of data, which would be ~43.2 billion records. A rough estimate that each record would contain about 480 bytes of data, would equate to ~18.7 terabytes of data in a month, and maybe three times that with indexes. Eventually we would like to grow the capacity of this system to store trillions of records.

We have (very basically) evaluated couchbase, cassandra, and mongodb so far as possible candidates for this project, however each proposes their own challenges. With couchbase the indexing is done at intervals and not during insertion of the data so the views are not up to date, cassandra's secondary indexes are not very efficient at returning results as they typically require scanning the entire cluster for results, and mongodb looks promising but appears to be far more difficult to scale as it is master/slave/sharded. Some other candidates we plan to evaluate are elasticsearch, mysql (not sure if this is even applicable), and a few column oriented relational databases. Any suggestions or real world experience would be appreciated.

Best Answer

In a company I work for we are dealing with similar amount of data (around 10 TBs of realtime searchable data). We solve this with Cassandra and I would like to mention couple of ideas that will allow you to do O(1) search on a multi TBs database. This is not specific to Cassandra db though, you can use it with other db as well.

Theory

Shard your data. There is no way a single server will reliably and realistically hold such volume of data.
Be ready for hardware faults and whole node failures, duplicate the data.
Start using many back-end servers from the beginning.
Use many cheaper commodity servers, compared to top-end high performance ones.
Make sure data is equally distributed across shards.
Spend a lot of time planning your queries. Derive API from the queries and then carefully design tables. This is the most important and prolonged task.
In Cassandra, you can design a composite column key and get access to that key in O(1). Spend time working on them. This will be used to access searchable records instead secondary index.
Make use of wide rows. They are useful for storing time-stamped events.
Never perform full-scan or in fact any operation more than O(Log N) on such volume. If you require anything more than O(Log N), offload such operations to Map-Reduce algorithms.

Practice

Don't spend time building OS images or installing servers on physical machines. Use cloud based providers for quick prototyping. I worked with Amazon EC2 and can highly recommend it for its simplicity, reliability and speed of prototyping.
Windows machines tend to be slower during boot time and take considerably more resources being in Idle state. Consider using Unix-based OS. Personally, I found Ubuntu server to be a reliable OS, but moreover there is a pretty good community at askubuntu
Think about networking, nodes shall ideally be close to each other to allow fast gossiping and meta-data exchange.
Do not go into extreme cases: really wide column rows or exceptionally long column families (tables). Best performance is achieved in the sane boundaries - if db supports that many N rows by design, it doesn't mean it performs well.
Our search takes about 3-5 seconds, much is due to the intermediate nodes between UI and the database. Consider how to bring requests closer to the database.
Use a network load balancer. Choose an established one. We use HAProxy, which is simple, but dead fast. Never had problems with it.
Prefer simplicity to complex solutions.
Look for free open-source solutions, unless you are backed up by a corporation's size budget. Once you go more than several servers, the costs of infrastructure might go sky high.

I do not work for Amazon and have no relations with HAProxy and Ubuntu teams. This is a personal opinion rather than any sort of promotion.

Related Solutions

MongoDB – Is It Always Faster to Create Indexes After Loading Data?

Back in the day we would bulk load our data in this way:

Drop indexes
Load data in the order for which the clustered index would be built (i.e., you export the data in a precise way)
After the load is completed, create the clustered index
Next, create any additional non-clustered indexes
Miller time (this was before I could afford decent beer)

That method always proved faster than leaving the indexes in place. However, this was for Sybase and SQL Server. I imagine other systems would be similar, but I can't say for certain.

Filter First or Join First in SQL Server Optimization?

Build up the query and see.
The query optimizer is often smart enough to filter early.
SQL is logical - in the where does not mean it will process last.

Clearly you want indexes on join and filter.

When you get to 5 or more joins the optimizer will often get defensive and go into a loop join.
Do I have a citation - no. It is an observation.

When it gets to 5 or more pulling the conditions into the join can help the optimizer.
Do I have a citation - no. It is an observation.

select * 
from tableA 
join tableB  
      on tableA.fkB = tableB.id 
     and tableA.date1 >= @date1start   
     and tableA.date1 <  @date1end 
join tableC  
      on tableA.fkC = tableC.id 
     and tableC.filter1 = 'do me early'

Build it up one join at a time and see when it goes stupid.
Optimize one join at at time.

If you are going to materialize then put a pk on the #temp.
Start with where you think you are going to get the most bang.

OR condition in where / join are the most trouble and often lead to a loop join.
Those should be the first to materialize.

You can force hash join but that is a slippery slope that can go bad.

It seems kind of odd all tables have a date dimension, you would typically have some lookup type tables with more static type data.

If you don't need output from the table then a where exists may preform better.