Mysql – SQL aggregate function performance in a table with many records that is well-indexed

innodbjoin;MySQLperformanceperformance-tuning

My question is about 3 tables that stores data of a single entity, and there is a report generation feature that refers these tables and calculate COUNT and AVG information.

There are three tables: review, rating, and rating_values, all three using MySQL InnoDB.

Review table

CREATE TABLE IF NOT EXISTS `review` (
  `review_id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Unique review ID of this review.',
  `nid` int(10) unsigned NOT NULL,
  `sub_target` int(10) unsigned NOT NULL,
  `timestamp` int(10) unsigned NOT NULL DEFAULT '0',
  `score` float DEFAULT NULL,
  PRIMARY KEY (`review_id`),
  KEY `nid` (`nid`),
  KEY `uid` (`uid`),
  KEY `nid_uid` (`nid`,`uid`),
  KEY `score` (`score`),
  KEY `timestamp` (`timestamp`),
);

Review table acts as the the primary table. Each review can have 0 or more (we have a limit of 18 at the moment in application level) rating values. A rating value is something like "Food", "Location", etc for a restaurant review.

Rating table

'rating' (
  `review_id` int(10) unsigned NOT NULL, //foreign key to {review}.review_id
  `type` varchar(255) NOT NULL,
  `score` float DEFAULT NULL,
  KEY `review_id` (`review_id`),
  KEY `score` (`score`)
);

In this table, type is a machine name of the "food", "location", etc I mentioned above. There is a foreign key to the base table that stores these rating type information.
In our application, each rating can have multiple rating values that count towards the main rating. Say, there is "food", and there can multiple sub questions such as "food – appearance", "food – taste". These values can be 0 or more. If there are more than 1 sub question, this score field contains the average of the sub questions (please see the table below).

Rating value table

'rating_value' (
  `review_id` int(10) unsigned NOT NULL,// foreign key to the {review}.review_id
  `type` varchar(255) NOT NULL,
  `sid` int(10) unsigned DEFAULT NULL,
  `score` int(10) unsigned DEFAULT NULL,
  KEY `review_id` (`review_id`),
  KEY `type` (`type`)
)

This is where I expect most of the data would go into. Each sub question has an sid that is referred here.


A typical review would add rows like this:

Review:

review_id | nid | sub_target | timestamp | score
1         | 583 |         786| 1388167200| 4.32

Rating (please note how the two floats add up to 4.32 above):

review_id|type     |score
1        |food     |2.10
1        |location |2.22

Rating value:

review_id|type     |sid|score
1        |food     |1  |2.00
1        |food     |1  |2.30
1        |location |2  |2.12
1        |location |3  |2.22
1        |location |3  |2.22
  • We will be generating reports with COUNT and AVG on rating_value tables. However, there will be always a JOIN with review table, and the filter would be review.nid, which is indexed.
  • When accessing reviews, application will almost every time use the review_id, which are either primary or foreign keys in all tables.
  • Sometimes we need to CAST the score field to nearest integer.

Admittedly I'm not an expert in databases, but I spent a lot of time normalizing the above, and I ended up with the above database after all three steps. I'm eager to learn, so if you have suggestions to above table structure, please don't hesitate to mention them. Note that all review and rating tables will contain some extra data that I didn't mention here. the rating table has a text field to allow users to write a text review. That's why rating values and rating are separated.

The reports we generate are never going to run on all rows. Just a small set of rows filtered by the nid (plus additional filters).

My questions are:

  • Is it common to spread data of a single user operation across multiple rows? I expect the rating_value table to grow almost 10 times (in record count) than the review table.

  • While (inner) JOINing tables, and calculating averages, can database engines filter down the eligible rows using the indexes first?

  • If you have any thoughts about the above schema, please don't hesitate to comment on.

Best Answer

Is it common to spread data of a single user operation across multiple rows?

Think of the "entities": User, Review, Rating. A "user" may go to the restaurant many times, and give a "review" each time. The "review" may give ratings for multiple things, such as 'food' and 'location'.

These are "Many-to-one" relationships, so Ratings has a review_id and Reviews has a user_id. ("Many-to-many" relationships need an extra table.)

Yes, database optimizer can usually use indexes as needed. We need to look at the specific SELECTs to say further.

KEY `nid` (`nid`), -- Redundant with the next; DROP this.
KEY `nid_uid` (`nid`,`uid`),

It looks like you did a shotgun approach to adding INDEXes. Do EXPLAIN SELECT ... on your various SELECTs -- This is to see which indexes are actually used. Then DROP the unused queries.

Every table should (must?) have a PRIMARY KEY. For Ratings, it should probably be PRIMARY KEY(review_id, type).

Your "reports" are likely to do JOINs and may end up COUNTing and SUMming things extra times, so beware. Do a sanity check on the results. The overages can be fixed, but I'll let you get to that point before trying to explain the problem.