Mysql – Store time series data efficiently without wasting space

database-designMySQLtime

I'm building an application to calculate different KPI metrics for customers of an ecommerce website e.g. (avg. order value, avg. items count and so on). KPIs are integer and or double values e.g. number of items bought, avg. order value, gross margin…

The application fetches orders data, calculate metrics and store them.
I'm using MySQL as a relational database.

about metrics:

I currently have 10 metrics to calculate for each customer.

Metrics can increase in the future but not so frequently so I can consider "10" as quite definitive. Anyway altering schema in the future is not a problem at all.

I need to calculate each metric on a weekly basis (at minimum).
Metrics are about customers.

about customers:

Customers are 30k and they are growing + 0.5k/month rate.

Not all customers buy with the same frequency. I can have occasional buyers but also heavy buyers.

I want to show a graph with the overall trend of a specific KPI in a given timespan.

I want to show a graph with the trend of a metric for a specific customer in a given timespan.

My entites are:

  • orders
  • customers
  • customers_kpi

I'm worried about storing a huge amount of useless data

52 weeks * 30k users * 4+ years = 6.2M rows at minimum

I have 2 questions:

  1. Should I store rows for customers without orders for a given timespan (e.g. the row will be all filled with NULL)? Can avoid it somehow without affecting data visualization?

  2. Which table structure is more efficient ("thin" table vs "fat" table) given that not all customers buy every week and number of metrics are unlikely to change often?

I'm in doubt among these 2 structures for customers_kpi table:

customer_id kpi1 kpi2 kpi3 ..kpiN from to

VS

customer_id kpi_name kpi_value from to

Best Answer

First off, a few million rows is nothing to be afraid of. Modern relational database management systems can handle up to trillions of rows on pretty standard hardware, when architected and indexed properly. Secondly, I like to measure things by a minimum of 10 years as a good test of time, so let's take your metric and annualize it over 10 years which gives us closer to ~15.6 million rows, and let's round it up to say 20 million (to account for customer growth etc). We're still in medium sized numbers, as far as number of records in a single table, so no big deal.

To answer your questions directly:

1. "Should I store rows for customers without orders for a given timespan (e.g. the row will be all filled with NULL)? Can avoid it somehow without affecting data visualization?"

A: There's no need to store empty rows. You can create a dates dimension table which will store a single row per date and then you can outer join to it such as SELECT * FROM dateDimensions AS D LEFT OUTER JOIN orders AS O ON D.date = O.orderDate WHERE D.date >= 'some date value' AND D.date < 'some other date value'. (You can join in your other tables here as needed.) This will help keep your table much leaner (maybe that 20 million rows in 10 years becomes 10 million or 5 million roughly) because now you're not storing multiple empty order or empty customer_kpi rows every week for every customer who didn't make an order, you're only storing one row per date (or your dates dimensions table can even be simplified to just storing one row per week - though probably overkill). And outer joining to your dates dimension table will maintain the same visualization you're looking for.

2. Which table structure is more efficient ("thin" table vs "fat" table) given that not all customers buy every week and number of metrics are unlikely to change often?

A: Both table structures have their place, and sometimes a denormalized table structure like your first example structure is better for an OLAP heavy / heavy reporting database regarding performance, but in general normalization is best practice when all things are equal, like your second example structure. Because 20 million rows (worse case number of rows based on my answer to your first question) is really nothing to be concerned about in a single table, and because you get better flexibility with querying about only specific KPIs (which can be more performant by not bringing back a fat row with a bunch of unneeded data), e.g. SELECT * FROM customers_kpi WHERE kpi_name = 'SomeSpecificKPI', I'd personally recommend starting with your second table structure. You can always easily transform it later on to look like your first structure if you need or should you change your mind and want to materialize it as that structure into a new table.