Timeseries Database – SQL or NoSQL?

nosql

I don't care about the general differences between SQL and NoSQL (or their traditional differences).

I am currently looking at altering the storage of our internal time series. They all contain financial data from a number of different sources. Currently, we are storing our data in a proprietary database. It's very much NoSQL, that has its own query language.

I'm interested in the community input: How would you store the data in a SQL database? What merits are there for using SQL over a NoSQL, specifically for time series? Am I insane for considering storing this in SQL?

Our data set consist of millions of time series, with around 10% of these containing millions of records each.
The time series are organised hierarchically: /Market/Instrument/Value/Frequency, where:

Market is a securities exchange, etc, basically a collection of instruments, usually similar instruments.
Instrument is an instrument. This could be an indicator (Brent Crude), an equity (GOOG), etc
Value is one of multiple types of data for an instrument. This could be a close, high, low, etc
Frequency is the frequency of a particular time series values. Weekly, daily, monthly, tick, arbitrary, etc.

How would the data be stored in a SQL db? One big table (maybe partitioned by something), one table per market or instrument, one table per time series.

Thank you in advance.

Best Answer

In general, for such a structured dataset I suspect you could write a custom data format which was faster for most daily operations (i.e. small data pulls from an arbitrary time). The benefit of moving to a standard DB tool is likely in some of the extras, for example ad hoc queries, multiple access, replication, availability etc. It's also easier to hire help to maintain a standards based data store.

If I were asked to set up a database to store that data, I would do the following:

Proposed schema

(1) Core data is placed into numerous (1000's) of individual tables, each containing two columns:

time: either a SQL DATETIME data type or a numeric type from some epoch (this is the primary key)
value: typed as appropriate for your data. I would default to single precision float, however a fixed-point data type may be more appropriate for financial transactions. This is probably unindexed.

These tables will get quite large, and you may want to manually partition them by (for example) year. But you'll have to check system performance and tune as appropriate.

These tables need unique names, and there are a couple of options. They could be human readable (e.g. nyse_goog_dailyhighs_2010) or (my preference) random. Either way a set of metadata tables is required, and random table names prevent developers from inferring anything into the name that was not meant to be inferred.

(2) Meta data is stored in separate tables, as required by the application:

An additional table or set of tables is required to keep track of the metadata. These tables will contain data about exchange, instrument, value, frequency, date ranges, provenance (where did the data come from), plus anything else you need. These are mapped to data table names.

If there is enough data, this lookup could actually provide a table name and database name, allowing a sort of self-implemented data sharding (if that is the correct use of the term). But I would hold that in reserve.

Then at the application layer I would query the metadata tables to determine where my data was located, and then perform relatively simple queries on the big data tables to get my data.

Advantages:

My (relatively limited) experience is that databases can generally handle a large number of small tables easier than a smaller number of large tables. This approach also enables easier maintenance (e.g. purging old data, rebuilding a corrupt table, creating/reloading from backups, adding a new entity). This completely decouples the different kinds of data, if (for example) you have data at different rates, or requiring different data types.
This skinny table concept should also allow fast disk access for what I suspect is the most common query, a contiguous range of data from a single entity. Most data applications are disk I/O limited, so this is worth considering. As a commenter has already implied, this my be an ideal application for a column-oriented database, but I have yet to find a column oriented product that is mainstream enough for me to bet my career on. This schema gets pretty close.

Disadvantages:

About half of your disk space is dedicated to storing time stamps, when quite frankly 100's or 1000's of the tables will have the exact same data in the timestamp column. (In fact this is a requirement if you want to perform easy table joins).
Storing table names and performing the dynamic lookup requires a lot of application complexity and string operations, which kind of makes me cringe. But it still seems better than the alternatives (discussed below).

Considerations:

Be careful of rounding in your time field. You want your values round enough to enable joins (if appropriate), but precise enough to be unambiguous.
Be careful of time-zones and daylight savings time. These are hard to test. I would enforce a UTC requirement on the data store (which may make me unpopular) and handle conversions in the application.

Variations:

Some variations that I have considered are:

Data folding: If the timeseries is equally spaced, then use one timestamp column and (for example) 10 data columns. The timestamp now refers to the time of the first data column, and the othe data columns are assumed equally spaced between that timestamp and the next one. This saves a lot of storage that was previously used to store timestamps, at a cost of significant query and/or application complexity. Contiguous range, single entity queries now require less disk access.

Multi-plexing: If multiple time series are known to use the same time series, then use one timestamp and (for example) 10 data columns as described above. But now each column represents a different time series. This requires an update to the metadata table, which is not a lookup into table and column name. Storage space is reduced. Queries remain simple. However contiguous range, single entity queries now require significantly more disk access.

Mega-table: Take the "multi-plexing" concept to the extreme, and put all data into a single table, once time series per column. This requires large amounts of disk access for contiguous range, single entity queries, and is a maintenance nightmare. For example adding a new entity now requires a MODIFY TABLE command on a many TB table.

For additional discussion on this format, see the various answers in: Too many columns in MySQL

Fully normalized table: Instead of using many 2-column tables, you could use one, three-column table, where the columns are time, dataid, and value. Now your metadata tables only need to lookup ID values, rather than tablenames or column names, which enables pushing more logic into the SQL queries, rather than the application layer.

Approximately 2/3 of Storage is now consumed with the normalizing columns, so this will use a lot of disk space.

You can use a primary key order of (dataid, timestamp) for fast contiguous, single entity queries. Or, you can use a primary key order of (timestamp. dataid) for faster inserts.

However, even after considering these variations, my plan for my next development is lots of tables, two-columns each. That, or the method soon to be posted by someone wiser than I :).

Best Answer

Proposed schema

Variations:

Related Solutions

Mongodb – How do databases store index key values (on-disk) for variable length fields

CAP Theorem vs. BASE – Understanding NoSQL Concepts

Related Question