Using an RDBMS for querying tens of Terabytes of time-series data

database-recommendationdatabase-sizeperformance

I stumbled over an algorithmic / data-structures question over on SO, which I'll short quote:

(…) opinions regarding best of
breed data structures to be used for indexing time-series (aka
column-wise data, aka flat linear).

Queries that will be required:

All values in the time range [t0,t1]

All values in the time range [t0,t1] that are greater/less than v0

All values in the time range [t0,t1] that are in the value range[v0,v1]

The data sets consist of summarized time-series (…)
The
data set(s) in question are about 15-20TB in size, hence processing is
performed in a distributed manner – because some of the queries
described above will result in datasets larger than the physical
amount of memory available on any one system.

Distributed processing in this context also means dispatching the
required data specific computation along with the time-series query,
so that the computation can occur as close to the data as is possible
– so as to reduce node to node communications (somewhat similar to map/reduce paradigm) – in short proximity of computation and data is
very critical.

I'll readily admit that problems of this scale are way over my head, but my first hunch (even with the data sizes mentioned) given this problem would have been to ask whether they checked if large-scale RDBMS (well, I guess Oracle, or Oracle, right?) could handle this in a sane way.

So here's the question: Can a (enterprise?) RDBMS handle this kind of problem today with acceptable performance vs. a "hand-coded" solution.

Note: Hope this isn't too vague, and feel free to re-tag as appropriate 🙂

Best Answer

Answering this from one point of view: SQL Server 2012 columnstores could easily handle this. I've seen it working and once segment elimination and batch processing kick in, those TBs are reduced to very few actual IOs an the results get churned in few ms (ie. most data is eliminated upfront and there is simply no need to scan 100s of TBs). The queries you asked about are exactly the kind columnstores are designed for. This is such an efficient storage/processing paradigm that there's simply no need for distributed computing, even at hundreds of TB of data.