Mysql – Approaching database design

database-designmariadbMySQL

I have 1000 sensors, and I need to hoard the value of each sensor at every second, for a month. This is just theory, the most realistically achievable unit of time will be determined when I start testing, though 5 seconds maximum.

If I do a single entry for each sensor at each second, I can get away with just 5 columns, entry_id, dim_id, dim_second, dim_date, sensorValue. However, that equates to (86,400 * 1000) * 31 or 2,678,400,000 rows in a month. That is a whole lot of rows.

If I had 1004 columns, I could do 86,400 * 31 or 2,678,400. That is a ton of columns.

Most of what I've read online tells me it is always undesirable to use columns in this way for RDS. However, if I were to do an entry per sensor, do I dump in a single fact table? Of course I wouldn't make 1000 tables, but should sensors be grouped together in fact tables, maybe by unit (flow, amperage, etc) to make the data set smaller for queries?

Or maybe RDS is not the proper choice for this application, and I should consider NoSQL? I have never worked with it.

I'm using a 16GB, Intel Core i7-4650u machine with a 1TB SSD for the development environment.

EDIT:
I should note that my data pipeline is summarizing this data and dumping it in other tables. The issue is that this has to be an on-prem solution, and I want to be able to bring back big batches of raw data for analysis. Hence the long period and small grain.

Best Answer

I don't know about mysql/mariadb, but a postgresql DB allows you to store arrays. An array of 86400 sensorValues would allow you to have one record per entry_id, dim_id, dim_date.

create table sensorvalues (
  entry_id serial primary key,
  dim_id int,
  dim_date date,
  sensorvalue int[86400]
);

Related Solutions

Data Model for Summarizing Student Info in SQL Server

Calculate those counts on the fly.

For most data sets of this nature, I would imagine calculating the count of tardiness events would be relatively cheap. How many rows do you expect there to be in the long run in your students or attendance tables? Probably on the order of tens of thousands at most. Contrast that to calculating an account balance from its transaction history in database for a large bank with billions of transactions. In this case I would consider persisting a summary of these aggregations somewhere.

If you wanted to pre-calculate this data anyway, I would summarize the count of tardiness events per student per year in an indexed view as suggested in the top answer here.

For example:

CREATE VIEW dbo.tardiness_summary
WITH SCHEMABINDING
AS
SELECT
     student_id
   , year
   , COUNT_BIG(*) AS tardy_count
FROM dbo.attendance
GROUP BY 
     student_id
   , year
;

CREATE UNIQUE CLUSTERED INDEX IX_tardiness_summary
ON dbo.tardiness_summary (
     student_id
   , year
;

This is not a flexible approach, however, as this view is now schema-bound to the base tables and thus any modifications to either the view or the table will require rebuilding the view. Indexed views also have many restrictions on how they may be created or queried. Their advantage is that they guarantee the summary table will stay in sync with its sources because the database engine is now doing this work for you.

MySQL – Database Redesign for Sensor Data Collection

You should think about partitioning the table for a big reason.

All indexes you have on a giant table, even just one index, can generated a lot of CPU load and disk I/O just to perform index maintenance when executing INSERTs, UPDATEs, and DELETEs.

I wrote an earlier post back on October 7, 2011 on why Table Partitioning would be a big help. Here is one excerpt from my past post:

Partitioning of data should serve to group data that are logically and cohesively in the same class. Performance of searching each partition need not be the main consideration as long as the data is correctly grouped. Once you have achieved the logical partitioning, then concentrate on search time. If you are just separating data by id only, it is possible that many rows of data may never be accessed for reads or writes. Now, that should be a major consideration: Locate all ids most frequently accessed and partition by that. All less frequently accessed ids should reside in one big archive table that is still accessible by index lookup for that 'once in a blue moon' query.

You can read my entire post later on this.

To cut right to the chase, you need to research and find out what data is rarely used in your 10GB table. That data should be placed in an archive table that is readily accessible should you need adhoc queries for a historical nature. Migrating that archival from the 10GB, followed by OPTIMIZE TABLE on the 10GB table, can result in a Working Set that is faster to run SELECTs, INSERTs, UPDATEs, and DELETEs. Even DDL would go faster on a 2GB Working Set than a 10GB table.

UPDATE 2012-02-24 16:19 EDT

Two points to consider

From your comment, it sounds like normalization is what you may need.
You may need to migrate out everything over 90 days old into an archive table but still access archive and working set at the same time. If your data is all MyISAM, I recommend using the MERGE storage engine. First, you create the MERGE table map once that unites a working set MyISAM table and an archive MyISAM table. You would keep data less than 91 days in one MyISAM table and rollover any data over 90 days old into the archive. You would query the MERGE table map only.

Here are two posts I made on how to use it:

Here is an additional post I made on tables with a lot of columns

Too many columns in MySQL

Best Answer

Related Solutions

Data Model for Summarizing Student Info in SQL Server

MySQL – Database Redesign for Sensor Data Collection

UPDATE 2012-02-24 16:19 EDT

Related Question