Mysql – Is normalization advisable on transactional data

insertMySQLnormalizationperformancereporting

I have read a lot of articles about normalized vs denormalized schemas, but I find very little mention of the cost of row insertion. I am considering the impact of converting a 100% denormalized database system into more normalized one. Below is the info that I believe to be pertinent to my question, which in essence is: Can I efficiently insert historical transaction data into normalized tables, as compared to the efficiency of a denormalized table?

  • I'm using MySQL, all tables are myISAM
  • The system can currently be described as 95% inserts, 5% selects, and 0% updates. (This is sales transaction data, and therefore immutable.)
  • I'd like to introduce a large number of statistical features to the app, so the number of selects will go way up, but it will still be mostly inserts.

At the time of record insertion, the data consists of:

storeid - int (we track multiple stores)
itemname - varchar(64)
customeruuid - char(36)
customername - varchar(64)
amount (int - this currency is integer values only, e.g. tokens.)
location - varchar(64)

My denormalized sales table is quite predictably exactly the same as the above data, with the addition of a date field which gets set to NOW() in the insert query. So right now, my inserts are brain dead simple and consist of a single sql statement.

However, I would like to start offering a lot more statistical features in the application, requiring a lot of aggregations. So that makes me consider normalization …

My first swipe at a normalized structure would be something like:

customers
----------
id - int, PK, AI
uuid - char(36)
name - varchar(64)

products
----------
id - int, PK, AI
storeid - int
name - varchar(64)
(with storeid + name being a composite unique index)

locations (a location is akin to a city)
-----------
id - int, PK, AI
name - varchar(64)

sales
-----------
id - int, PK, AI (needed elsewhere in the system)
customerid - int (fk to customers table)
productid - int (fk to products table)
amount
locationid - int (fk to locations table)
date

Very simple stuff so far. But here's where I start to lose the plot on the insertion process.

Customers, products, and locations are not known ahead of time. So a transaction may be introducing these values to the system for the very first time, requiring an insert into the appropriate lookup table.

So, do I really need to do three "insert … on duplicate key update" operations to get the customer, product, and location IDs for each insert into the sales table? Is there some shortcut I'm missing here?

The next obvious question is: "is normalizing the right thing to do in this scenario?"

All reporting will be done with a single store id, and a date range. Example queries are:

  • How many copies of item A sold across all locations?
  • What is the total sales amount per location?
  • What are the top 10 most popular items?
  • Which 10 items generated the most revenue?

Would these queries be more efficient against normalized data, even though they would require joins?

Other possibilities I've considered, but not sure how to weigh:

  • Could proper indexing on the denormalized sales table yield better SELECT results than normalized tables could provide?
  • Since we're talking about immutable data, would it be a good approach to use pre-calculated aggregate data for previous months? I could picture a just-in-time generation of data if it doesn't exist, and saving it off, so that page refresh or subsequent reporting loads the summary data instead of running summary queries. This seems a bit complicated, and adds a lot of application logic, but seems more friendly to the DB.

Lastly, scale info to put this all into perspective. The DB currently contains about 8 million rows, with a current insertion rate of appx 125k rows per month, from many thousands of customers and many thousands of products. There are several hundred stores, and each store equates to one person, and reporting would typically be done only by that one person, with a completely unpredictable frequency (i.e. they may load their account 10 times a day, or once a year, or almost never).

Sorry this is hella long. I may get slapped by the mods, so apologies ahead of time. This is a complex system, and I've already simplified it a lot in this post!

Thanks for any advice!

Best Answer

You can definitely keep all your dimensions and measures in one fact table and not use any dimension tables. Make sure your OLAP tool supports this though.

Normalizing out your dimensions into other tables is done mostly to minimize the size of the fact table, which can get large fast.

With no dimension tables you're looking at about 336 MB per year (not counting indexes), which isn't so bad.

With dimension tables, you're looking at about 34 MB per year, plus a couple dozen MB for storing dimension details. Indexes will be smaller too.

You'll want to expand your date column into something more analyzable (year, month, quarter, etc), which will add to the size.

You'll want to index all fields. Drop indexes before insert, add them after.

You can use a tool like Pentaho Aggregation Designer to find useful aggregates and generate them for you.