Mysql – Database redesign opportunity: What table design to use for this sensor data collection

database-designMySQLpartitioning

Background

I have a network of approximately 2000 sensors, each of which has about 100 data points that we collect on 10 minute intervals. These data points are typically int values, but some are strings and floats. This data should be stored for 90 days, more if possible and still efficient.

Database Design

When originally tasked with this project, I wrote a C# app that wrote comma separated files for each sensor. At the time there were not as many, when someone wanted to look at trends we would open the csv in Excel and graph it as needed.

Things grew and we switched to a MySQL database. I created a table for each sensor (yes I know, lots of tables!); it has been working well, but it has some limitations. With so many tables, it is obviously impossible to write a query that will find data among all the sensors when looking for a particular value.

For the next version, I switched to Microsoft SQL Server Express, and put all sensor data into one large table. This also works, and lets us do queries to find values among all sensors that are of interest. However, I ran into the 10GB limit for the Express version, and have decided to switch back to MySQL rather than invest in SQL Server Standard.

The Question

I am happy with MySQL performance and scalability, but am uncertain if sticking to the all-data-in-one-table approach is best. 10GB in a single table seems to be asking for a different design. I should mention that the need to query data for graphing is still there, and I'm concerned that there will be performance issues for a query that graphs, for example, temperature data for one sensor over the full 90 days. (In other words the graph should be something that is quick to produce, without waiting for SQL to sort through piles of data just to isolate the sensor of interest.)

Should I split this table up in some way to increase performance? Or is it not unusual to have such a large table?

I have indexes on the Sensor ID and Timestamp columns, which is pretty much the defining boundaries for any query. (i.e. get data for sensor X from time A to time B).

I've read a little bit about sharding and partitioning, but don't feel those are appropriate in this case.


Edit:

Based on comments and answers so far, some additional info may be helpful:

Not Indefinite Storage: Currently I do not store data past 90 days. Daily, I run a query that removes data older than 90 days. If it becomes important in the future, I will store more, but for now it is sufficient. This helps keep the size in check and performance high(er).

Engine Type: The original MySQL implementation used MyISAM. When creating the tables this time for the new implementation (one data table instead of many) they've defaulted to InnoDB. I don't believe I have a requirement for one or the other.

Normalization: There are of course other tables besides the data collection table. These support tables store things such as network information for the sensors, login information for users, etc. There isn't much to normalize (as far as I know). The reason the data table has so many columns is that there are that many variables from each sensor. (Multiple temperatures, light levels, air pressure, etc.) Normalization to me means that there is no redundant data or repeating groups. (At least for 1NF.) For a given sensor, storing all values at a particular time requires one row of data and there are no 1:N relationships involved there (that I see).

I could break apart the table functionally, making (for example) all temperature-related values in one table, and all air pressure-related values in another. While this might improve efficiency for someone making a temperature-only query, I still have to insert all of the data at once. Still, the efficiency gain might be worthwhile for SELECT operations. Obviously I would be better off breaking apart the table vertically based on how often users request the data. Perhaps this is all I should do. I suppose in asking my question I am looking for confirmation that doing this would be worthwhile.


Edit 2:

Data Usage: Ultimately much of the data is never looked at or needed, because we typically focus only on items with problems. But in attempting to find problems we use various tools to search the data and determine what items to zoom in on.

For example, we noticed a correlation between a memory usage value (a customer-specific proprietary software program) and a reboot/crash. One of the data points I collect relates to this memory usage, and I was able to look at historical data to show that devices become unstable after a particular memory usage is exceeded. Today, for the subset of devices running this software, I check this value and issue a reboot command if it is too high. Until this was discovered, I did not think collecting this data was of value.

For this reason, I've maintained that the some 100 data points be collected and stored, even if the value is questionable. But in normal day-to-day usage, users typically examine perhaps a dozen of these parameters. If a user becomes interested in a particular geographic area, he may (using software) generate graphs or spreadsheets of data for perhaps a few dozen sensors. It's not uncommon to look at a 30-day graph with two or three plot lines showing such things as temperature, air pressure, and light levels. Doing this would run a query similar to this:

SELECT sensor_id, location, data_timestamp, temp1, air1, light1
FROM data
WHERE data_timestamp >= '2012-02-01'
AND sensor_id IN (1, 2, 3);

(In the original MySQL version, where each sensor had its own table, three separate queries would be issued, but the results combined in software to create the graph.)

Because the data table contains so many rows (~10 million), despite having indices on id and data_timestamp, the performance is notably worse than the multiple-table scenario (4500 rows returned in 9 seconds as opposed to less than one second with this example). The ability to find which sensors meet certain criteria is practically zero in the multiple-table schema, and thus the reason for moving to a single table.

This type of query can be done by multiple users in quick succession as they select different groups of data and compare the graphs from each result. It can be quite frustrating to wait nearly 10 seconds per graph or spreadsheet.

Data is discarded after 90 days. It could be archived but it not currently a requirement.

Hopefully this information helps more adequately show how the data is used after collection and storage.

Best Answer

You should think about partitioning the table for a big reason.

All indexes you have on a giant table, even just one index, can generated a lot of CPU load and disk I/O just to perform index maintenance when executing INSERTs, UPDATEs, and DELETEs.

I wrote an earlier post back on October 7, 2011 on why Table Partitioning would be a big help. Here is one excerpt from my past post:

Partitioning of data should serve to group data that are logically and cohesively in the same class. Performance of searching each partition need not be the main consideration as long as the data is correctly grouped. Once you have achieved the logical partitioning, then concentrate on search time. If you are just separating data by id only, it is possible that many rows of data may never be accessed for reads or writes. Now, that should be a major consideration: Locate all ids most frequently accessed and partition by that. All less frequently accessed ids should reside in one big archive table that is still accessible by index lookup for that 'once in a blue moon' query.

You can read my entire post later on this.

To cut right to the chase, you need to research and find out what data is rarely used in your 10GB table. That data should be placed in an archive table that is readily accessible should you need adhoc queries for a historical nature. Migrating that archival from the 10GB, followed by OPTIMIZE TABLE on the 10GB table, can result in a Working Set that is faster to run SELECTs, INSERTs, UPDATEs, and DELETEs. Even DDL would go faster on a 2GB Working Set than a 10GB table.

UPDATE 2012-02-24 16:19 EDT

Two points to consider

  1. From your comment, it sounds like normalization is what you may need.
  2. You may need to migrate out everything over 90 days old into an archive table but still access archive and working set at the same time. If your data is all MyISAM, I recommend using the MERGE storage engine. First, you create the MERGE table map once that unites a working set MyISAM table and an archive MyISAM table. You would keep data less than 91 days in one MyISAM table and rollover any data over 90 days old into the archive. You would query the MERGE table map only.

Here are two posts I made on how to use it:

Here is an additional post I made on tables with a lot of columns

Too many columns in MySQL