You should think about partitioning the table for a big reason.
All indexes you have on a giant table, even just one index, can generated a lot of CPU load and disk I/O just to perform index maintenance when executing INSERTs, UPDATEs, and DELETEs.
I wrote an earlier post back on October 7, 2011 on why Table Partitioning would be a big help. Here is one excerpt from my past post:
Partitioning of data should serve to group data that are logically and
cohesively in the same class. Performance of searching each partition
need not be the main consideration as long as the data is correctly
grouped. Once you have achieved the logical partitioning, then
concentrate on search time. If you are just separating data by id
only, it is possible that many rows of data may never be accessed for
reads or writes. Now, that should be a major consideration: Locate all
ids most frequently accessed and partition by that. All less
frequently accessed ids should reside in one big archive table that is
still accessible by index lookup for that 'once in a blue moon' query.
You can read my entire post later on this.
To cut right to the chase, you need to research and find out what data is rarely used in your 10GB table. That data should be placed in an archive table that is readily accessible should you need adhoc queries for a historical nature. Migrating that archival from the 10GB, followed by OPTIMIZE TABLE
on the 10GB table, can result in a Working Set that is faster to run SELECTs, INSERTs, UPDATEs, and DELETEs. Even DDL would go faster on a 2GB Working Set than a 10GB table.
UPDATE 2012-02-24 16:19 EDT
Two points to consider
- From your comment, it sounds like normalization is what you may need.
- You may need to migrate out everything over 90 days old into an archive table but still access archive and working set at the same time. If your data is all MyISAM, I recommend using the MERGE storage engine. First, you create the MERGE table map once that unites a working set MyISAM table and an archive MyISAM table. You would keep data less than 91 days in one MyISAM table and rollover any data over 90 days old into the archive. You would query the MERGE table map only.
Here are two posts I made on how to use it:
Here is an additional post I made on tables with a lot of columns
Too many columns in MySQL
Yes, there is a pretty big pitfall you're going to run into fairly quickly, and that is with the size and maintenance of the tables. You are somewhat on the right track by saying that you want to put your data into a temporary table daily, and then move it into your permanent table, but you'll soon run into trouble with this scheme.
For example, let's say you want to "roll off" the oldest month's worth of data after two years. In your design, you would have to issue a DELETE statement against your big, big table. This will likely be somewhat slow, depending on the number of indexes you have. Also, it will cause index fragmentation, and the only way to fix that would be to rebuild or reorganize the indexes on this very large table which would also cause performance problems. There are a whole host of other issues with a big single table type design as well. For example, with a big, single table, you can't do FILEGROUP based backups, which means that if you want to have a full backup of your database, it's gonna be BIG, and it's gonna take a LONG time to complete.
What's the solution? Table partitioning. Read about this in depth, in as many places as you can. Basically, partitioning allows you to split up your data onto "tables within tables" -- each partition shares the same schema, and is accessed through the table object, but can be indexed and maintained differently. Partitions are basically tables, cut up by some useful key. In your case it will likely be date. They can be dropped just like (and just as fast as) tables, which means that if you partition your big data tables by date, you can simply drop old partitions instantly, with no adverse effect to the indexes on any of the other partitions. You can put partitions on different filegroups, which means that older partitions can be rolled off, or rolled on to cheaper commodity storage if it's not commonly used. Last but not least, in SQL 2012 you'll be able to create COLUMNSTORE type indexes on your older, read-only partitions, while having a different, more insert-oriented indexing scheme on the active partition where you're inserting all your sensor data.
Hope this helps. You have a good amount of research to do regarding partitioning and partitioning schemes, but hopefully now you know the direction you need to be looking.
P.S.:
Oh, and I forgot your bulleted list of questions...
Answer 1, 2, and 5. See above. Answer 3: In SQL Server, you can compress on a partition by partition basis, so compress your older partitions aggressively using PAGE compression. But I believe your out-of-row large data types will not be compressed if you do this -- again, you may want to alleviate this problem by normalizing your sensor values. Answer 4: Absolutely not, but if all you want to do is store static data by day and never search on it any other way, compressed flat files may be a much easier way to go.
P.P.S:
Oh, and another thing. You don't need your two-table solution to make this all work. Large binary sensor data should be of type VARBINARY(MAX) because its values can be stored "out of row" but still be a column in a single table (see the sp_tableoption documentation). You may want to consider normalizing some of your sensor data out of the binary data you have in the table, though, because your database won't be good for much beyond retrieving chunks of sensor data by time if you don't.
Best Answer
Since (a) the information you are working with appears to be, in a of itself, a very valuable organizational resource, and (b) the volume of data will be considerable, I would decidedly (c) build a relational database on one of the major SQL platforms.
That of course —from a very general perspective— requires three essential factors:
A clearly defined conceptual schema, in which one has to identify and mark out with precision the prototypes of things, i.e., the entity types (including their properties and interrelationships) of relevance in the business environment you are working with (e.g., the Towers and Sensors you mention).
As you know, this point entails establishing continuous and productive communication with business experts.
A logical layout that reflects the conceptual level with accuracy, by means of tables (i.e., mathematical relations) holding well-delimited columns with appropriate column names and types (i.e., relation attributes) and all the corresponding constraints to ensure that the data complies with all the rules determined at the previous tier.
Therefore, it is here where the vast power of the relational model comes into play (though its advantages have positive repercussions at other levels of abstraction).
A physical arrangement that, e.g., boosts the execution speed of the —dynamic— logical data manipulation operations and guarantees the logical constraints.
Since the relational model offers physical data independence, a database management system (DBMS for brevity) can provide any kind of structure at this level, not exclusively indexes, to support the logical definitions. In the case of the leading SQL platforms, yes, this commonly implies, precisely, setting up an indexing strategy based on the database-specific query tendencies, and you brought up very interesting considerations with respect to some possible configurations but, without knowing the particular informational necessities with exactitude, offering specific advices in this regard would not be suitable.
Other elements that deserve evaluation are, e.g., upgrading network infrastructure to increase bandwidth, enabling proper server configurations (hardware- and software-wise), etc. And, if, and only if, a practitioner is qualified enough, he or she could even modify the source code of the DBMS of choice (more feasible in open source environments, naturally).
In this way, the following aspects that you highlight
would be well-addressed, because you would easily be able to declare queries to, e.g., obtain information in very meaningful forms. For instance, you can get data associated with
1750
, installed at the Tower identified by TowerNumber31
, between the Date1 June 2017
and the Date27 June 2017
.Furthermore, since (1) the data in a relational database is logically managed in terms of sets with the aid of operations based on the relational algebra, and (2) the different SQL engines are physically optimized (some more than the others) for set processing, you can, e.g.,
The data manipulation possibilities are in fact huge —demonstrating the unmatched versatility of the relational paradigm— since you can work not only with base tables (the ones declared with
CREATE TABLE … ( … );
statements) but also with derived ones (the ones expressed viaSELECT …;
operations, sometimes fixed asVIEWs
). In other words, you can (i) express new data structures based on (ii) prior ones operating on (iii) the single underlying relational construct, i.e., the mathematical relation.Evidently, the arrangement of the base tables and columns of a relational database can evolve, and (a) new base tables or columns may be incorporated into it when (b) keeping track of new entity types or entity type properties is deemed worthwhile in the pertinent business context. In other words, neither the initial structure nor the opening constraints of a relational database are expected to be static or immutable. Besides, a database that is appropriately organized from the very beginning tends to be much easier to modify when new informational requirements arise.
In agreement with the considerations above, the logical format of the applicable sets must be produced declaratively, at the database logical level. The graphical or presentational formatting of the sets (e.g., the colouring or font faces utilized) must in turn be processed by way of the code of one or more application programs (yes, mostly in a procedural manner, perhaps with the assistance of an object-oriented framework, HTML, etc.), in order to make the access and presentation of such sets user-friendly. Certainly, you could as well use reporting software that connects with your database.
The modeling of a database of relevance
Given that you will be working with Sensor data (which, among other features, typically involves information in the shape of time series), you may find of help multiple database design and overall administration principles contained in the two exceptional answers, by @PerformanceDBA, to the questions entitled:
The Relational, Flat File and NoSQL approaches
The relational model by Dr. Edgar Frank Codd, although published in 1970, remains genuinely the most modern and elegant method (based on logic and set theory) to cope with the problem of data management. The distinct SQL DBMSs are, in turn, the most popular approximations (that, although not fully compliant, are nevertheless really powerful) to the systems proposed in the relational theory, and some of them have been heavily optimized (e.g., regarding their physical-level mechanisms) for even decades now. In addition, the main SQL platforms of course can (and will be able to) work with the most up-to-date storage (e.g., hard drives) and processing (e.g., CPUs) technologies quite efficiently.
When built on a powerful DBMS, a relational database that is properly designed at the conceptual, logical and physical levels would decidedly become a self-contained, self-descriptive and self-protective asset that (1) is trustworthy and (2) offers a fast response, two aspects that, as you know, are of great significance.
Flat files
Therefore, the claim that follows
is easily discarded, because the flat file approach would be:
Whereas the —much more convenient— relational fashion, to say the least:
If, however, you opt for utilizing flat files, you should evaluate the employment of a robust utility like Awk that, although not a DBMS (and was not designed as such), supplies handy resources to deal with files, records, fields, etc. See The GNU Awk User’s Guide for more information on this subject.
NoSQL
“Unstructured data” and associated terms
As per their propaganda, the initial justification for the usage of NoSQL DBMSs is that they are meant to be used in business domains that involve handling “unstructured data”, so that calls for the question:
In that respect, it must be said that data, by its very nature, is structured; if it had no structure then it would be something meaningless, consequently such a thing (i) could not be considered data and (ii) would not be worth administering. Hence, “unstructured data” is a contradictory and unfortunate expression.
Other phrase of those contexts is “semi-structured data”. That phrase suggests that there exists data that is structured “in part” or “in half” so, in accordance with the previous paragraph, only the “part” or “half” that is structured can be actual data, the remaining “part” or “half” is merely a shapeless thing because it lacks structure, and cannot be referred to as data.
Yet another, alas, typical term that is found in NoSQL marketing is “polymorphic data”. If said term signifies something like “data that has many different forms”, then it is in fact ordinary data, it comes about in many distinct forms as always. And since it has many different forms, then it presents many distinct structures, so there is nothing special about this “kind” of data.
Needless to say, data and data structures have always been susceptible to changes, then there is nothing unusual in this regard either.
Data volume growth
Evidently, the volumes of information managed by means of computerized systems have definitely grown over the years —and will keep growing exponentially as time passes, because new systems are being built every day—, but that is a fact that does not have to do with the structure of the information per se.
Lack of a rounded theoretical foundation
A critical limitation of NoSQL systems (there are of different classes, e.g., document- and graph- based) is that none of the current products —although heavily marketed and labeled as “modern”— possesses a sound theoretical basis (if at all) that supports each and everyone of the three most important elements of a proper DBMS, i.e., tools for data (a) definition, (b) manipulation, and (c) constriction. Thus, the NoSQL approach actually suggests a regression to an ancient era in which the handling of the data was performed in an ad hoc and unsound course of action, with all the needless complexity it entails.
Nowadays, graph systems are included within the “NoSQL” spectrum. These software products invite to manage data by virtue of operations on two distinct structures: nodes and relationships —which, once again, is in conflict with the term “unstructured data”—, and they stand out in the “NoSQL” group because they do have a mathematical foundation. However, graph products are rather similar to network platforms, which are deemed obsolete since tens of years ago (an obvious drawback is that, as suggested above, they need two structures for data representation, while a relational DBMS —as per the information principle— needs only one).
Even if the creation of the different NoSQL systems is chronologically newer in comparison with the origins of the majority of the SQL DBMSs, most of the concepts onto which the NoSQL products are based are, in effect, primitive.
A NoSQL program should be employed, mostly, in scenarios where, e.g.,
But, even if the structure of the data at issue is not defined before the creation of the concerning systems, one ore more people will necessarily have to
after the database and app(s) have entered the production stage in order to be able to get the most out of all the resources invested in a project, then data structure delineation is a task that cannot be bypassed, it has to be done sooner or later.
So, while going the NoSQL way is a possibility, all the factors previously mentioned should definitely be taken into account.
The most robust method
In contrast, approaching the informational requirements of a business environment in a relational manner —i.e., with a general paradigm behind— offers the possibilities of (1) managing the data in its natural structure from the beginning —which eases integration with other data sources— and also of (2) producing new trustworthy structures by dint of the manipulation of a single instrument —as explained in previous sections— that has a robust scientific basis.
According to your description of the scenario in question, you have already identified a particular structure in terms of the relevant organizational needs, so I suggest requesting that the business domain experts validate it. Successively, I recommend taking advantage of (i) the constructs —the relation, the constraints and the operations— provided by the relational model to handle said structure and the respective data, and of (ii) your SQL DBMS of choice which will very likely offer very efficient physical tools that can hold up the present demands and will supply future scalability.