Postgresql – Numerical data with lots of NULLs: 6NF or document database

Architecturedatabase-designdesign-patterndocument-orientedpostgresql

I have a source of numerical data mostly comprised of sensor readings, physical quantities and flags. Each associated with a datetime.

|  id  |        timetag           |   sensor1   |      sensor2     |  flag1  |  ...  |
--------------------------------------------------------------------------------------
| 943  | 2016-12-08 10:27:00.000  |  1.2323523  |  -23123.5346234  |    0    |  ...  |
| 944  | 2016-12-08 10:27:00.038  |     NULL    |     163.6151345  |    1    |  ...  |
| 945  | 2016-12-08 10:27:02.000  |  1.2477801  |       NULL       |    3    |  ...  |
                                       ...

The data has the following characteristics:

  • There are hundreds of fields, and new unexpected fields may pop up at any time. We need to record those too.
  • For every row, only a small subset (<10%) of the fields are available.
  • Sometimes a field is available but corrupted, i.e. NaN, and we need to record that.
  • Data is written once, updated almost never and read relatively often.
  • We need to do range queries on the data, both on the date and on the actual numbers.
  • We don't want to lose any data.
  • It doesn't need to be a distributed system.

I'm wondering what the best way to store these data would be.

Solution 1: One big table

A single huge table would be a bad idea, because it would be full of NULLS and inconvenient. Moreover, it would be impossible to distinguish between "recorded but NaN" NULLs and "not recorded" NULLs.

Solution 2: 6th normal form

The first solution that comes to mind is to use 6NF, creating one table for each field.

This is clean in theory, but

  1. queries would require many many JOINs at the same time;
  2. new tables would have to be created on the fly when a new field arrives for the first time;
  3. fields that are always paired, like for example longitude and latitude would have to be automatically detected and their tables merged for efficiency;
  4. if at some point one of the columns in a "merged" table (e.g. lon in the lon-lat example above) starts to be obtained independently (without lat), to keep the table in 6NF we would have to detect that and automatically split the table into separate tables.

The above makes for a rather complex solution.

Solution 3: Document database

I've been considering PostgreSQL with JSONB to mimic a document database.
This would allow us to only store the data we have in each row, but still have a single table.

However, as detailed in my (unanswered) question on Stack Overflow, JSONB doesn't seem appropriate for the numeric operations we need, like range queries.
On the other hand, NoSQL technologies like MongoDB, in which I have no experience, don't guarantee the consistency we are looking for.

Question

Of the solutions above, which would you favor? Are there any other options I'm missing?

Best Answer

JSONB might be a good option. If the TYPES of your flags and sensors are always the same (let's say they are all floats and ints), you could also use another strategy:

CREATE TABLE tb
    (id           serial primary key, 
     timetag      timestamp default now(),
     sensor_nr    integer,
     sensor_value float,
     flag_nr      integer,
     flag_value   integer
    ) ;

If you have real values for sensors 2 and 1000, you would then do:

INSERT INTO 
    tb 
    (sensor_nr, sensor_value) 
VALUES 
    (2, 123.456), 
    (1000, 0.123) ;

Or, if there are flags:

INSERT INTO 
    tb 
    (sensor_nr, sensor_value, flag_nr, flag_value)
VALUES
    (1000, 123.456, 1000, 0),
    (2, 234.567, 2, 1) ;

The NaN value would be represented by NULL; and the "no reading" would be represented by just the non-existence of the time/sensor_nr row.

If you need frequent range SELECTs of different types, you would index by timetag; and also by sensor_nr and sensor_value... The insert costs would be relatively high, but the SELECTs could be fast.

If the "new unexpected fields that pop at any type" are of different types (i.e.: you have some co-ordinate pairs (float,float) and not just simple floats) this approach won't be flexible enough. In that occasion, probably JSON(B) and the new indices is probably your best alternative; at the cost of losing (some) type safety.