I have got historic data of around 20 machine sensors with a time resolution of one second stored in csv-files which need to be imported to a SQL database prior to further data handling and analysis.
A represantative mockup of the data tim import looks like this:
+---------------------+------------------+------------------+------------------+-----+-------------------+
| timestamp | name_of_sensor_0 | name_of_sensor_1 | name_of_sensor_2 | ... | name_of_sensor_19 |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:20 | 10 | 11 | 12 | ... | 19 |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:21 | 20 | 21 | 22 | ... | 29 |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:22 | 30 | 31 | 32 | ... | 39 |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:23 | 40 | 41 | 42 | ... | 49 |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:24 | 50 | 51 | 52 | ... | 59 |
+---------------------+------------------+------------------+------------------+-----+-------------------+
| 2019-12-25 05:35:25 | 60 | 61 | 62 | ... | 60 |
+---------------------+------------------+------------------+------------------+-----+-------------------+
For each sensor there are some descriptive meta data which should also be available in the database since these might contain important information needed to gain insights during further data analysis. Each sensor has the following meta data:
- designator
- acquisition_channel
- system_code
- unit_code
- sub_system_code
- function_code
- major_counting_number
- minor_counting_number
- measurement_unit
- description
In order to combine the readings and the meta data, I thought about using two SQL tables. One containing all the meta data as given above:
SensorTable
- id
- designator
- acquisition_channel
- unit_code
- system_code
- sub_system_code
- function_code
- major_counting_number
- minor_counting_number
- measurement_unit
- description
And another table containing the readings of each sensor at a given timestamp. To be able to combine the data from both tables, I could perform a JOIN
on the sensor_id
which is a foreign key:
ReadingsTable
- id
- timestamp
- sensor_uid
- value
Both table defintions are implemeted using `sqlalchemy` like this:
class Sensor(Base):
__tablename__ = 'sensors'
id = Column(Integer, primary_key=True, autoincrement=True)
designator = Column(String, unique=True, nullable=False)
acquisition_channel = Column(String)
unit_code = Column(String)
system_code = Column(String)
sub_system_code = Column(String)
function_code = Column(String)
major_counting_number = Column(String)
minor_counting_number = Column(String)
measurement_unit = Column(String, nullable=False)
description = Column(String)
readings = relationship('Reading')
class Reading(Base):
__tablename__ = 'readings'
id = Column(Integer, primary_key=True, autoincrement=True)
timestamp = Column(DateTime, nullable=False)
sensor_id = Column(Integer, ForeignKey('sensors.id'), nullable=False)
value = Column(Float, nullable=False)
This table design looked pretty obvious to me and should fulfill the very basic principles of normalization. However, after having a look at the resulting table rows, I am wondering if I need to (or can) normalize the timestamp
column any further. Every row in the ReadingsTable
contains the sensor reading at a given timestamp. Since all sensor measure at the exact same time, I get a lot of duplicate timestamps in the same column. Recalling my data mockup from above, an excerpt of the ReadingsTable
would look as following:
+-----+---------------------+-----------+-------+
| id | timestamp | sensor_id | value |
+-----+---------------------+-----------+-------+
| 60 | 2019-12-25 05:35:22 | 1 | 30 |
+-----+---------------------+-----------+-------+
| 61 | 2019-12-25 05:35:22 | 2 | 31 |
+-----+---------------------+-----------+-------+
| 62 | 2019-12-25 05:35:22 | 3 | 32 |
+-----+---------------------+-----------+-------+
| ... | ... | ... | ... |
+-----+---------------------+-----------+-------+
Do I need to normalize the timestamp
column any further due to the duplicate entries for each timestamp? How could I do this? Which adoptions should I make to my database / table design?
I had a look at this answer which suggests a very similar approach, but still does not address the duplicates in the timestamp column.
Best Answer
Do NOT normalize any "continuous" or "numeric" values, such as a timestamp.
First of all,
TIMESTAMP
andDATETIME
each take 5 bytes.INT
takes 4 bytes. So, that's not much savings. A 3-byteMEDIUMINT UNSIGNED
(0..16M) is still not enough savings to make up for the following...More importantly, if doing a range scan, having the "range" in a "dimension" table will make the query very inefficient.
Readings.id
may not be necessary -- The sensor_id plus the timestamp is unique; make the combination thePRIMARY KEY
forReading
. Meanwhile,sensor_id
should be as small as practical.TINYINT UNSIGNED
allows 256 sensors; is that enough? (Note: It will be repeated inreadings
a huge number of times.)PS to Akina: When fractional seconds was added,
TIMESTAMP
went from 4 bytes to 5 or more. The more is if you have fractions.DATETIME
went from a packed-decimal representation of 8 bytes to only 5 (or more)Indexes
Sensors
will have itsid
and any others you will regularly use. Since it is a tiny table; the choices are not important.The indexes for
Readings
are very important because of the size of the table ant the likelihood of big queries. You can't really decide until you have theSELECTs
sketched out. But I can guess:The PK allows the efficient gathering of info on a single sensor. And it is adequately efficient for the 20 inserts per second that you appear to need.
I don't know if the secondary index will be of any use.
PARTITIONing
The only likely use for partitioning is if you intend to delete "old" data. We can discuss
PARTITION BY RANGE
for timeseries. Otherwise, don't usePARTITIONs
.Summary Table(s)
When graphing a month's worth of data to the second. No matter how the data is stored and indexed, it will take a lot of time to fetch a few millin rows, only to throw away most of the info when building the graph. We can discuss Summary Tables further.
References (for partitioning, summary tables, etc): http://mysql.rjweb.org/