Read-Only Database – Normalize or Not for Best Query Performance

database-designquery-performance

I have a pandas DataFrame that looks a bit like this:

         id        name       date     col1     col2  total
0 123456748 EXAMPLENAME 2020-09-01 SOMEDATA MOREDATA   5.99
1 123456748 EXAMPLENAME 2020-09-01 SOMEDATA OTHERDATA 25.99

There are 15 columns, the name values are associated with the ID, and the rest is some data relevant for that person. col2 would have about 400 unique values. The database would be about 300,000,000 rows to start with, and then will grow at about 500,000 records per week.

The records in the database will never be updated or deleted, only new ones will be added. Final purpose of the database is to have a web app in which the user can select the ID of the person. The database would retrieve the information, and the website would render a graph and a dataframe. The expected traffic of the website is very low, so I was thinking about using SQLite.

Based on that, I have two questions:

  1. Should I use Relational Databases, like PostgreSQL or SQLite, or should I try MongoDB? I'm interest on the performance of the database to select and retrieve the data; don't care too much about insert time as it won't be done very often (once per week).
  2. Based on performance to query, in case you select Relational Databases, should I have all data in one table or should I split it (normalize it)? I read that normalizing a database when its purpose is only to query and store the data, could lead to worse performance than having it all in one table. However, I do not know much about databases and would prefer an expert opinion, or resources to learn more about the correct implementation and maintenance.

Thanks.

Best Answer

All things being equal, a normalized database will provide better performance for a wide range of queries. It is a myth that a normalized database will be slow, especially in an analytic environment or with a large number of rows. Usually if performance suffers, it is because a database has not been properly normalized.

A normalized database also provides for better data integrity which, depending on your industry and whether or not you are storing data related to payments, may be necessary.

However, if you're just looking to dump data, not perform any sort of transforms, always return the exact type of data, and use the same access path you probably don't need to spend as much time normalizing the data to meet your requirements.

But it would still make sense to use a relational engine that features B-Trees (and preferably clustered indexes/index organized tables) to speed data recovery. Some NoSQL engines support methods to quickly access single records (Hash maps, occasionally B-Trees), but they won't offer the flexibility of a good RDBMS (not SQLite).

Another thing to consider is staying power/support. Many database engines have existed in a mature form for 30+ years and have existing tools developed to assist with backups/ETL/maintenance, a knowledge base, developers experienced in their use, and some assurance the worst bugs have been worked out (MongoDb was awful in that regard early on).

From comment:

I am storing payments. Why does a normalized database provides better data integrity?

So it depends on whether you are handling payments in real time versus storing an archive history, but generally a relational database is preferable because:

  1. Transactions can enforce complex operations are performed in a way that keeps the database consistent.
  2. All operations against the database are logged, so changes made between backups are auditable.
  3. With the right primary key definition, uniqueness is guaranteed and all totals/calculations performed will be consistent.
  4. With the right clustered index and/or partitioning scheme, aggregation is very fast.