Mysql – How to structure db where main records are large

mongodbMySQLnosql

I was wondering if I could ask for some quick opinions.

I am prototyping an app which seems simple but I can’t readily arrive at an architecture that feels right.

We have data that is simple and fits well into standard sql format, eg users, organisations, projects, reports.

However, the ‘reports’ are the problem. They can typically be between 1000 and 100,000 'line items'. Each item with around 10 fields. A bit like a large spreadsheet, and in fact this is where they originate usually.

A user (lets say 100 of them) generates say 10 projects a year, and each project has say 10 reports, and each report has say 10000 items (or rows) .. then if there was one massive table quickly growing to millions of rows, with each row ‘belonging’ to a report etc .. well this doesn’t feel right.

To make this more tricky, the reports themselves can almost be considered 'mini-databases' in themselves, in that although the meat of the data is thousands of 'line items' (with uniform column requirements), they are structured into 'zones', 'sections', 'subsections', and there is also an 'overview/meta' section as well.

In theory, I could normalize this right from 'user' to 'item' something like users->projects->reports->zones->sections->subsections->items … but again this just doesn't feel right to me; what the DB has to do to simply go and get a report seems overkill. Also all of this report info would then be all mashed together for all the different user accounts.

Every report query would have to go a collect thousands of rows out of millions, and assemble the client json from different tables, and there would be no other real benefit from being able to query the table in this way. All the reports are self-contained really.

As such, I am wondering whether this a case for a nosql route, like mongo or documentDB, where I can just throw thousands of lines of json in a Report Collection or similar. This would have further implication though, as the backend frameworks I like to use don’t play all that nice with nosql, and we lose a bit of the standard easy relational schema that would still seem to fit the other tables of the app (e.g. user accounts, RBAC, users belonging to organisations, etc).

.. but there would be a similar question with a json store .. is it better to maintain a 'tree' structure, and keep all the line items within their sections, and sections within their zones etc, or normalize out so there is one big load of lineitems, that have references to their sections within them, etc.

I’ve messed around with using MySQL for everything, and using the json field type to store the reports as one big json, but I don’t know whether this is pushing the intention of the field-type a little, and it’s hard to query into.

A consideration here is that on the frontend, users will update one (or a few) row(s) at a time, and I need to be able to handle that. In this way, I suppose there's a similarity with google sheets, or airTables or something. Do these big apps tend to use rdbms?

I notice there seem to be more and more database options appearing all the time now, eg fauna supabase etc .. makes it even more tricky to choose the right solution.

Thanks in advance for any tips

Best Answer

NoSQL is not a direct solution to an amount of data problem, rather it is mainly a solution for a schema-less or highly variable schema problem (as you've experienced when you said "we lose a bit of the standard easy relational schema" regarding using NoSQL). Because you admit your data is highly relational, fitting a standard schema, then you'll likely want to continue to use a RDBMS like MySQL.

That being said, millions of rows in a table is on the small side and certainly no concern for a relational database when architected and indexed appropriately. If I play out the calculation with your example's numbers above, it seems like we're actually talking about 100 million records per year, which is now what I subjectively consider a medium sized table. The fact the width of the table is only 10 columns helps keep it compact as well. This is still no problem for any modern RDBMS. I've personally worked with larger tables in the 10s of billions of records, about 50 columns wide, and regular B-Tree indexing still works amazingly well even at that scale. I could query thousands of rows from that table in a second or less and millions of rows from that table in about 10 seconds, on very modest hardware.

Aside from appropriately indexing and architecting your database, you can use things like Partitioning, and routinely archiving old data that is less frequently queried into a separate table. Long story short, it sounds like your architecting things correctly and I wouldn't stress too much on the amount of data you plan to support yet, until it becomes problematic (which I don't think it will).