Redesign the storage of large amounts of sensor data

database-designnosqlscalability

I have been tasked to implement/redesign a solution that will store weather data from a sensor array. The array will consist of ~40 towers, each with about ~10 sensors each that'll sample atmospheric conditions at 10 second intervals for an undetermined amount of time (years).
Some of the applications and requirements for this task are as follows:

  • Manage and retrieve tower/sensor configurations to make sense of data analysis.
  • Data visualization by sensor or time intervals for meteorological observations.
  • Provide customers with reliable and persistent data resources/data sets to compare model and sensor performance (may require some post-processing to deliver in the format required?).

Note: The current solution (implemented as a proof of concept, with 5 towers) stores data as flat files (one file per hour).

I was originally unsure if this would constitute a big data problem in the future, so I researched on a couple of solutions both for relational and NoSQL databases, but I feel I need a little more guidance, as I am no expert in data management.

One of the solutions I thought was to store data in a relational database indexed by tower, sensor, and timestamps and partitioning the table by date.

Another, based on future scaling, was to store it in a document-type NoSQL database, like MongoDB, and mimic the structure of the current solution.

Are any of these good approaches? If not, what would be a better/recommended solution? Also, would it even be necessary to redesign the current solution? I was told that the rationale for using flat files is that they believed a relational database would take too much overhead. Is there a way to avoid this if it was the case?

Best Answer

Since (a) the information you are working with appears to be, in a of itself, a very valuable organizational resource, and (b) the volume of data will be considerable, I would decidedly (c) build a relational database on one of the major SQL platforms.

That of course —from a very general perspective— requires three essential factors:

  1. A clearly defined conceptual schema, in which one has to identify and mark out with precision the prototypes of things, i.e., the entity types (including their properties and interrelationships) of relevance in the business environment you are working with (e.g., the Towers and Sensors you mention).

    As you know, this point entails establishing continuous and productive communication with business experts.

  2. A logical layout that reflects the conceptual level with accuracy, by means of tables (i.e., mathematical relations) holding well-delimited columns with appropriate column names and types (i.e., relation attributes) and all the corresponding constraints to ensure that the data complies with all the rules determined at the previous tier.

    Therefore, it is here where the vast power of the relational model comes into play (though its advantages have positive repercussions at other levels of abstraction).

  3. A physical arrangement that, e.g., boosts the execution speed of the —dynamic— logical data manipulation operations and guarantees the logical constraints.

    Since the relational model offers physical data independence, a database management system (DBMS for brevity) can provide any kind of structure at this level, not exclusively indexes, to support the logical definitions. In the case of the leading SQL platforms, yes, this commonly implies, precisely, setting up an indexing strategy based on the database-specific query tendencies, and you brought up very interesting considerations with respect to some possible configurations but, without knowing the particular informational necessities with exactitude, offering specific advices in this regard would not be suitable.

    Other elements that deserve evaluation are, e.g., upgrading network infrastructure to increase bandwidth, enabling proper server configurations (hardware- and software-wise), etc. And, if, and only if, a practitioner is qualified enough, he or she could even modify the source code of the DBMS of choice (more feasible in open source environments, naturally).

In this way, the following aspects that you highlight

  • Manage and retrieve tower/sensor configurations to make sense of data analysis.
  • Data visualization by sensor or time intervals for meteorological observations.
  • Provide customers with reliable and persistent data resources/data sets to compare model and sensor performance (may require some post-processing to deliver in the format required?).

would be well-addressed, because you would easily be able to declare queries to, e.g., obtain information in very meaningful forms. For instance, you can get data associated with

  • the Sensor identified by SensorNumber 1750, installed at the Tower identified by TowerNumber 31, between the Date 1 June 2017 and the Date 27 June 2017.

Furthermore, since (1) the data in a relational database is logically managed in terms of sets with the aid of operations based on the relational algebra, and (2) the different SQL engines are physically optimized (some more than the others) for set processing, you can, e.g.,

  • compare set a with set b;
  • join set c with set d;
  • obtain subset f through a restriction on set e;
  • produce n subsets from n set intersections;
  • project n attributes from set f
  • retrieve information from set z that is the result of a union of set x with set y;
  • and so on.

The data manipulation possibilities are in fact huge —demonstrating the unmatched versatility of the relational paradigm— since you can work not only with base tables (the ones declared with CREATE TABLE … ( … ); statements) but also with derived ones (the ones expressed via SELECT …; operations, sometimes fixed as VIEWs). In other words, you can (i) express new data structures based on (ii) prior ones operating on (iii) the single underlying relational construct, i.e., the mathematical relation.

Evidently, the arrangement of the base tables and columns of a relational database can evolve, and (a) new base tables or columns may be incorporated into it when (b) keeping track of new entity types or entity type properties is deemed worthwhile in the pertinent business context. In other words, neither the initial structure nor the opening constraints of a relational database are expected to be static or immutable. Besides, a database that is appropriately organized from the very beginning tends to be much easier to modify when new informational requirements arise.

In agreement with the considerations above, the logical format of the applicable sets must be produced declaratively, at the database logical level. The graphical or presentational formatting of the sets (e.g., the colouring or font faces utilized) must in turn be processed by way of the code of one or more application programs (yes, mostly in a procedural manner, perhaps with the assistance of an object-oriented framework, HTML, etc.), in order to make the access and presentation of such sets user-friendly. Certainly, you could as well use reporting software that connects with your database.

The modeling of a database of relevance

Given that you will be working with Sensor data (which, among other features, typically involves information in the shape of time series), you may find of help multiple database design and overall administration principles contained in the two exceptional answers, by @PerformanceDBA, to the questions entitled:

The Relational, Flat File and NoSQL approaches

The relational model by Dr. Edgar Frank Codd, although published in 1970, remains genuinely the most modern and elegant method (based on logic and set theory) to cope with the problem of data management. The distinct SQL DBMSs are, in turn, the most popular approximations (that, although not fully compliant, are nevertheless really powerful) to the systems proposed in the relational theory, and some of them have been heavily optimized (e.g., regarding their physical-level mechanisms) for even decades now. In addition, the main SQL platforms of course can (and will be able to) work with the most up-to-date storage (e.g., hard drives) and processing (e.g., CPUs) technologies quite efficiently.

When built on a powerful DBMS, a relational database that is properly designed at the conceptual, logical and physical levels would decidedly become a self-contained, self-descriptive and self-protective asset that (1) is trustworthy and (2) offers a fast response, two aspects that, as you know, are of great significance.

Flat files

Therefore, the claim that follows

I was told that the rationale for using flat files is that they believed a relational database would take too much overhead.

is easily discarded, because the flat file approach would be:

  • pre-scientific;
  • far from optimal for large volumes of data;
  • too cumbersome;
  • application program-dependent (and you would have to implement by hand most of the features that proper DBMSs offer natively);
  • its performance will easily be undermined;
  • etc.

Whereas the —much more convenient— relational fashion, to say the least:

  • would offer large scalability (it is physical-level-independent, so you could enhance the underlying physical mechanisms as needed);
  • would bring a simple style to manipulate the data (via abstract operations) and
  • could work with multiple application programs simultaneously (e.g., one or more mobile apps, and/or one or more web apps, and/or one or more desktop apps, etc.).

If, however, you opt for utilizing flat files, you should evaluate the employment of a robust utility like Awk that, although not a DBMS (and was not designed as such), supplies handy resources to deal with files, records, fields, etc. See The GNU Awk User’s Guide for more information on this subject.

NoSQL

“Unstructured data” and associated terms

As per their propaganda, the initial justification for the usage of NoSQL DBMSs is that they are meant to be used in business domains that involve handling “unstructured data”, so that calls for the question:

  • What is the expression “unstructured data” supposed to mean?

In that respect, it must be said that data, by its very nature, is structured; if it had no structure then it would be something meaningless, consequently such a thing (i) could not be considered data and (ii) would not be worth administering. Hence, “unstructured data” is a contradictory and unfortunate expression.

Other phrase of those contexts is “semi-structured data”. That phrase suggests that there exists data that is structured “in part” or “in half” so, in accordance with the previous paragraph, only the “part” or “half” that is structured can be actual data, the remaining “part” or “half” is merely a shapeless thing because it lacks structure, and cannot be referred to as data.

Yet another, alas, typical term that is found in NoSQL marketing is “polymorphic data”. If said term signifies something like “data that has many different forms”, then it is in fact ordinary data, it comes about in many distinct forms as always. And since it has many different forms, then it presents many distinct structures, so there is nothing special about this “kind” of data.

Needless to say, data and data structures have always been susceptible to changes, then there is nothing unusual in this regard either.

Data volume growth

Evidently, the volumes of information managed by means of computerized systems have definitely grown over the years —and will keep growing exponentially as time passes, because new systems are being built every day—, but that is a fact that does not have to do with the structure of the information per se.

Lack of a rounded theoretical foundation

A critical limitation of NoSQL systems (there are of different classes, e.g., document- and graph- based) is that none of the current products —although heavily marketed and labeled as “modern”— possesses a sound theoretical basis (if at all) that supports each and everyone of the three most important elements of a proper DBMS, i.e., tools for data (a) definition, (b) manipulation, and (c) constriction. Thus, the NoSQL approach actually suggests a regression to an ancient era in which the handling of the data was performed in an ad hoc and unsound course of action, with all the needless complexity it entails.

Nowadays, graph systems are included within the “NoSQL” spectrum. These software products invite to manage data by virtue of operations on two distinct structures: nodes and relationships —which, once again, is in conflict with the term “unstructured data”—, and they stand out in the “NoSQL” group because they do have a mathematical foundation. However, graph products are rather similar to network platforms, which are deemed obsolete since tens of years ago (an obvious drawback is that, as suggested above, they need two structures for data representation, while a relational DBMS —as per the information principle— needs only one).

Even if the creation of the different NoSQL systems is chronologically newer in comparison with the origins of the majority of the SQL DBMSs, most of the concepts onto which the NoSQL products are based are, in effect, primitive.

A NoSQL program should be employed, mostly, in scenarios where, e.g.,

  • the IT personnel lacks the technical skills required to determine (or to determine opportunely) the structure of the data of interest —e.g., due to its complexity—; and/or
  • the organization cannot afford suitable education and training for the present staff, or cannot hire new staff that possesses the required education and training; and/or
  • when the integrity and consistency of the data is not particularly important; and/or
  • when blending the concerning data with that of mission critical systems that demand high precision is not expected.

But, even if the structure of the data at issue is not defined before the creation of the concerning systems, one ore more people will necessarily have to

  • discover the aforesaid structure,
  • discard all the surrounding “interference” and
  • collect and link the proper data

after the database and app(s) have entered the production stage in order to be able to get the most out of all the resources invested in a project, then data structure delineation is a task that cannot be bypassed, it has to be done sooner or later.

So, while going the NoSQL way is a possibility, all the factors previously mentioned should definitely be taken into account.

The most robust method

In contrast, approaching the informational requirements of a business environment in a relational manner —i.e., with a general paradigm behind— offers the possibilities of (1) managing the data in its natural structure from the beginning —which eases integration with other data sources— and also of (2) producing new trustworthy structures by dint of the manipulation of a single instrument —as explained in previous sections— that has a robust scientific basis.

According to your description of the scenario in question, you have already identified a particular structure in terms of the relevant organizational needs, so I suggest requesting that the business domain experts validate it. Successively, I recommend taking advantage of (i) the constructs —the relation, the constraints and the operations— provided by the relational model to handle said structure and the respective data, and of (ii) your SQL DBMS of choice which will very likely offer very efficient physical tools that can hold up the present demands and will supply future scalability.