Postgresql – Numerical data with lots of NULLs: 6NF or document database

Architecturedatabase-designdesign-patterndocument-orientedpostgresql

I have a source of numerical data mostly comprised of sensor readings, physical quantities and flags. Each associated with a datetime.

|  id  |        timetag           |   sensor1   |      sensor2     |  flag1  |  ...  |
--------------------------------------------------------------------------------------
| 943  | 2016-12-08 10:27:00.000  |  1.2323523  |  -23123.5346234  |    0    |  ...  |
| 944  | 2016-12-08 10:27:00.038  |     NULL    |     163.6151345  |    1    |  ...  |
| 945  | 2016-12-08 10:27:02.000  |  1.2477801  |       NULL       |    3    |  ...  |
                                       ...

The data has the following characteristics:

There are hundreds of fields, and new unexpected fields may pop up at any time. We need to record those too.
For every row, only a small subset (<10%) of the fields are available.
Sometimes a field is available but corrupted, i.e. NaN, and we need to record that.
Data is written once, updated almost never and read relatively often.
We need to do range queries on the data, both on the date and on the actual numbers.
We don't want to lose any data.
It doesn't need to be a distributed system.

I'm wondering what the best way to store these data would be.

Solution 1: One big table

A single huge table would be a bad idea, because it would be full of NULLS and inconvenient. Moreover, it would be impossible to distinguish between "recorded but NaN" NULLs and "not recorded" NULLs.

Solution 2: 6th normal form

The first solution that comes to mind is to use 6NF, creating one table for each field.

This is clean in theory, but

queries would require many many JOINs at the same time;
new tables would have to be created on the fly when a new field arrives for the first time;
fields that are always paired, like for example longitude and latitude would have to be automatically detected and their tables merged for efficiency;
if at some point one of the columns in a "merged" table (e.g. lon in the lon-lat example above) starts to be obtained independently (without lat), to keep the table in 6NF we would have to detect that and automatically split the table into separate tables.

The above makes for a rather complex solution.

Solution 3: Document database

I've been considering PostgreSQL with JSONB to mimic a document database.
This would allow us to only store the data we have in each row, but still have a single table.

However, as detailed in my (unanswered) question on Stack Overflow, JSONB doesn't seem appropriate for the numeric operations we need, like range queries.
On the other hand, NoSQL technologies like MongoDB, in which I have no experience, don't guarantee the consistency we are looking for.

Question

Of the solutions above, which would you favor? Are there any other options I'm missing?

Best Answer

JSONB might be a good option. If the TYPES of your flags and sensors are always the same (let's say they are all floats and ints), you could also use another strategy:

CREATE TABLE tb
    (id           serial primary key, 
     timetag      timestamp default now(),
     sensor_nr    integer,
     sensor_value float,
     flag_nr      integer,
     flag_value   integer
    ) ;

If you have real values for sensors 2 and 1000, you would then do:

INSERT INTO 
    tb 
    (sensor_nr, sensor_value) 
VALUES 
    (2, 123.456), 
    (1000, 0.123) ;

Or, if there are flags:

INSERT INTO 
    tb 
    (sensor_nr, sensor_value, flag_nr, flag_value)
VALUES
    (1000, 123.456, 1000, 0),
    (2, 234.567, 2, 1) ;

The NaN value would be represented by NULL; and the "no reading" would be represented by just the non-existence of the time/sensor_nr row.

If you need frequent range SELECTs of different types, you would index by timetag; and also by sensor_nr and sensor_value... The insert costs would be relatively high, but the SELECTs could be fast.

If the "new unexpected fields that pop at any type" are of different types (i.e.: you have some co-ordinate pairs (float,float) and not just simple floats) this approach won't be flexible enough. In that occasion, probably JSON(B) and the new indices is probably your best alternative; at the cost of losing (some) type safety.

Parallelization

Perhaps you'd be able to break up the query and run it over only segments of your locations table, and then run these queries in parallel using a connection pooler like pgBouncer?

For some good info about access data in sequential chunks (pagination), check out this amazing blog post by Markus Winand.

An example of how you might break this into chunks might be (warning:untested SQL ahead)

SELECT
id
FROM
(SELECT * FROM locations LIMIT x OFFSET y) AS locs_fragment
WHERE
a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND 
a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND 
criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND 
criteria_2 = 23 AND 
criteria_3 = 'ABCD' AND 
(criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
criteria_5 = 'A' AND
criteria_6 = 1;

Note that the sub-select SELECT * FROM locations LIMIT x OFFSET y, given no additional criteria, will extract rows in ctid order. By selecting appropriate LIMIT and OFFSET values for a rnage of queries, you can essentially run the query in parallel.

Denormalization

Don't rule out denormalization as an option as well. In this case, it can help you to much more quickly get you the results you want. Of course, denormalization will only really help if your query criteria will be fairly stable over time, so that you can effectively pre-calculate results sets.

I'd say there are two decent options:

(1) Use partial indexing, where a given index is designed around some set of criteria, as

CREATE INDEX idx_locations_criteria_set_0
ON locations (id)
WHERE
a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND 
a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND 
criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND 
criteria_2 = 23 AND 
criteria_3 = 'ABCD' AND 
(criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
criteria_5 = 'A' AND
criteria_6 = 1;

With such an index defined, if you make a query with a matching predicate, an index scan is performed rather than having to perform all the calculations for the SELECT statement.

(2) Another denormalization option is to use triggers during INSERT on your locations table to verify if certain conditions are met, and when they are met, store the id of that locations entry in another table, thus making a simple table scan all you need to do at a later time to find matching entries.

Both of these cases will require additional storage space, as well as some minor INSERT overhead for predicate checking, but it allows you to distribute your predicate checking over time, rather than being forced to do it all at once when a query is issued.

Conclusions

As with any advice on the internet, take mine with a grain of salt! :P I've done some work in both of these areas, but that doesn't make me an expert in your application area. Run some small scale tests on either approach to see what works best for your application.

Solution 1: One big table

Solution 2: 6th normal form

Solution 3: Document database

Question

Best Answer

Related Solutions

Mysql – Storing metadata of various data types in a MySQL database

postgresql – Postgres vs MapReduce for Matching Reciprocal Needle Pairs

Parallelization

Denormalization

Conclusions

Related Question