MySQL – Storing vs Calculating Aggregate Values

aggregatedatabase-designMySQL

Are there any guidelines or rules of thumb to determine when to store aggregate values and when to calculate them on the fly?

For example, suppose I have widgets which users can rate (see schema below). Each time I display a widget I could calculate the average user rating from the Ratings table. Alternatively I could store the average rating on the Widget table. This would save me from having to calculate the rating every time I display the widget, but then I'd have to recalculate the average rating each time a user rated a widget.

Ratings       Widgets
---------     -------
widget_id     widget_id
user_id       name              
rating        avg_rating  <--- The column in question

Best Answer

It depends. Pre-calculating aggregate values places a larger load on writes, deriving them makes reads more difficult

If you are frequently accessing a derived value, pre-calculation is a valid de-normalization step. However, in this instance, I recommend using a Materialized View (a view, written to disk, linked by trigger to the parent tables). The materialized view is designed to store frequently asked but tedious-to-derive data, and is useful for high numbers of writes and low numbers of reads.

In a high-write, high-read scenario, consider having a task in the background which mimics the effects of a materialized view, but in less than real-time. This will present a "good enough" average while preserving write and read performance.

In no circumstances, should you treat the derived column like a "normal" column: make sure the data presented in the Widgets "view" is present elsewhere in the table, such that the entire tuple can be derived by whatever processes you emplace. This question is also strongly database (and database-version) specific, so I recommend performance testing of the aggregate (with appropriate indexes) against a normal-sized data set and the materialized view.

Related Solutions

Sql-server – Performance effect of indexing foreign keys

It may still be useful to have indexes on the larger table to help the optimizer for specific SELECT queries, even if you will "never" change any of the values.

However, without a lot more knowledge of your system and the types of queries you will run, it's pretty difficult for any of us to able to tell you whether you should have indexes there or not. I'm afraid it's just something you're going to have to test.

Some background in this great article by Erin Stellato:

http://www.sqlperformance.com/2012/11/t-sql-queries/benefits-indexing-foreign-keys

You also might consider avoiding the joins in some cases - when you are sure the values will never change, and eliminating the join can simplify the execution plan, you can consider just inline constants, e.g.

SELECT CASE RatingID
  WHEN 1 THEN 'poor'
  WHEN 2 THEN 'average'
  ... 
END
FROM dbo.BiggerTable
...

Also I suggest never naming something ID. If it's a RatingID, call it RatingID everywhere it exists in the model.

Mysql – Storing dynamic data in NoSQL

I'll answer your question in an orthodox manner, with a twist of heresy:

Orthodoxy: You shouldn't store data in a column in a relational database that isn't "atomic."

Heresy: In your specific situation, you could -- maybe -- consider this blob of JSON to be an atomic object.

Years ago, Chris Date said it like this:

"A relation is said to be in first normal form (abbreviated 1NF) if and only if it satisfies the condition that it contains scalar values only"

^{Date, C.J. An Introduction to Database Systems, 6th edition (Addison-Wesley, 1995)}

Later, he took a somewhat softer stance:

"1NF just means each tuple in the relation contains exactly one value, of the appropriate type, for each attribute. Observe in particular that 1NF places no limitations on what those attribute types are allowed to be."

^{Date, C. J. Database Design and Relational Theory: Normal Forms and All That Jazz (OReilly Media, 2012)}

The "exactly one value" I'm arguing for here is "exactly one JSON object" (which could, in turn contain a JSON array).

Storing things in JSON in a column is a bad idea if you need to the DBMS to manipulate it in any way, since, of course, it can't be properly indexed like properly normalized data can be... but, arguably, if you really really really don't intend for the DBMS to do anything with what you're storing other than write and read it, the case could be made to store a chunk of JSON in a single column, claiming the JSON array of values to be a single atomic value.

The big objection, I think, to doing this, is when it's done out of a lack of familiarity with the relational model or out of laziness or naivete. Obviously, there are a lot of ways it could be done wrong, but I'd suggest that there's nothing inherently wrong about storing a chunk of JSON in a database, As Long As You Know What You're Doing.™

And, of course, you could use a MySQL FULLTEXT index on it, now that those are supported in InnoDB as of (MySQL 5.6).

Best Answer

Related Solutions

Sql-server – Performance effect of indexing foreign keys

Mysql – Storing dynamic data in NoSQL

Related Question