Mysql – Store timeseries data in normal SQL database

MySQLnormalizationpostgresqltime-series-databasetimestamp

I have got historic data of around 20 machine sensors with a time resolution of one second stored in csv-files which need to be imported to a SQL database prior to further data handling and analysis.

A represantative mockup of the data tim import looks like this:

+---------------------+------------------+------------------+------------------+-----+-------------------+
| timestamp           | name_of_sensor_0 | name_of_sensor_1 | name_of_sensor_2 | ... | name_of_sensor_19 |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:20 | 10               | 11               | 12               | ... | 19                |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:21 | 20               | 21               | 22               | ... | 29                |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:22 | 30               | 31               | 32               | ... | 39                |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:23 | 40               | 41               | 42               | ... | 49                |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:24 | 50               | 51               | 52               | ... | 59                |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:25 | 60               | 61               | 62               | ... | 60                |
+---------------------+------------------+------------------+------------------+-----+-------------------+

For each sensor there are some descriptive meta data which should also be available in the database since these might contain important information needed to gain insights during further data analysis. Each sensor has the following meta data:

- designator
- acquisition_channel
- system_code
- unit_code
- sub_system_code
- function_code
- major_counting_number
- minor_counting_number
- measurement_unit
- description

In order to combine the readings and the meta data, I thought about using two SQL tables. One containing all the meta data as given above:

SensorTable
- id
- designator
- acquisition_channel
- unit_code
- system_code
- sub_system_code
- function_code
- major_counting_number
- minor_counting_number
- measurement_unit
- description

And another table containing the readings of each sensor at a given timestamp. To be able to combine the data from both tables, I could perform a JOIN on the sensor_id which is a foreign key:

ReadingsTable
- id
- timestamp
- sensor_uid
- value

Both table defintions are implemeted using `sqlalchemy` like this:

class Sensor(Base):
    __tablename__ = 'sensors'

    id = Column(Integer, primary_key=True, autoincrement=True)
    designator = Column(String, unique=True, nullable=False)
    acquisition_channel = Column(String)
    unit_code = Column(String)
    system_code = Column(String)
    sub_system_code = Column(String)
    function_code = Column(String)
    major_counting_number = Column(String)
    minor_counting_number = Column(String)
    measurement_unit = Column(String, nullable=False)
    description = Column(String)
    readings = relationship('Reading')


class Reading(Base):
    __tablename__ = 'readings'

    id = Column(Integer, primary_key=True, autoincrement=True)
    timestamp = Column(DateTime, nullable=False)
    sensor_id = Column(Integer, ForeignKey('sensors.id'), nullable=False)
    value = Column(Float, nullable=False)

This table design looked pretty obvious to me and should fulfill the very basic principles of normalization. However, after having a look at the resulting table rows, I am wondering if I need to (or can) normalize the timestamp column any further. Every row in the ReadingsTable contains the sensor reading at a given timestamp. Since all sensor measure at the exact same time, I get a lot of duplicate timestamps in the same column. Recalling my data mockup from above, an excerpt of the ReadingsTable would look as following:

+-----+---------------------+-----------+-------+
| id  | timestamp           | sensor_id | value |
+-----+---------------------+-----------+-------+
| 60  | 2019-12-25 05:35:22 | 1         | 30    |
+-----+---------------------+-----------+-------+
| 61  | 2019-12-25 05:35:22 | 2         | 31    |
+-----+---------------------+-----------+-------+
| 62  | 2019-12-25 05:35:22 | 3         | 32    |
+-----+---------------------+-----------+-------+
| ... | ...                 | ...       | ...   |
+-----+---------------------+-----------+-------+

Do I need to normalize the timestamp column any further due to the duplicate entries for each timestamp? How could I do this? Which adoptions should I make to my database / table design?

I had a look at this answer which suggests a very similar approach, but still does not address the duplicates in the timestamp column.

Best Answer

Do NOT normalize any "continuous" or "numeric" values, such as a timestamp.

First of all, TIMESTAMP and DATETIME each take 5 bytes. INT takes 4 bytes. So, that's not much savings. A 3-byte MEDIUMINT UNSIGNED (0..16M) is still not enough savings to make up for the following...

More importantly, if doing a range scan, having the "range" in a "dimension" table will make the query very inefficient.

Readings.id may not be necessary -- The sensor_id plus the timestamp is unique; make the combination the PRIMARY KEY for Reading. Meanwhile, sensor_id should be as small as practical. TINYINT UNSIGNED allows 256 sensors; is that enough? (Note: It will be repeated in readings a huge number of times.)

PS to Akina: When fractional seconds was added, TIMESTAMP went from 4 bytes to 5 or more. The more is if you have fractions. DATETIME went from a packed-decimal representation of 8 bytes to only 5 (or more)

Indexes

Sensors will have its id and any others you will regularly use. Since it is a tiny table; the choices are not important.

The indexes for Readings are very important because of the size of the table ant the likelihood of big queries. You can't really decide until you have the SELECTs sketched out. But I can guess:

PRIMARY KEY(sensor_id, timestamp)
INDEX(timestamp)

The PK allows the efficient gathering of info on a single sensor. And it is adequately efficient for the 20 inserts per second that you appear to need.

I don't know if the secondary index will be of any use.

PARTITIONing

The only likely use for partitioning is if you intend to delete "old" data. We can discuss PARTITION BY RANGE for timeseries. Otherwise, don't use PARTITIONs.

Summary Table(s)

When graphing a month's worth of data to the second. No matter how the data is stored and indexed, it will take a lot of time to fetch a few millin rows, only to throw away most of the info when building the graph. We can discuss Summary Tables further.

References (for partitioning, summary tables, etc): http://mysql.rjweb.org/