Mysql – Database schema – performance v correctness information

database-designMySQLschema

I have a database that stores information about the users of my website (username, password, etc). I also store information about their activity on the site.

For example, I have a large number of questions on my site. I have a few topics, and each topic has a number of questions.

Whenever a user answers a question, I store the result in a table containing the tuples: (user_id, question_id, timesAnswered, timesCorrect)

I also have a table called topics_and_questions containing (question_id(PK), topic_id(PK)) which can be used to determine which topic(s) the question belongs to.

Based on the above, it is possible to determine how many questions a user has answered on a particular topic. However, this would involve searching through the first table for all question_ids answered by the user, then searching the topics_and_questions table. (The first table could get rather large)

Does it make sense to instead create another table users_and_topics which contains (user_ID, topic_id, questionsAnswered) and update this whenever a question is answered? It would seem as though this would be bad database practise as data would essentially be replicated (this table can be derived from current tables). However, it could provide performance benefits as it would mean less work for the server. I would expect around 1 request for this information every 5 minutes per user of my website.

I'm using innoDB, MySQL database if it matters

Best Answer

I concur with your speculation that it's bad design practice. I'd recommend against it. Any time you do this sort of thing, it almost inevitably comes back to bite and the performance difference compared to a properly designed schema should be small.

You say that as it stands now, your process "would involve searching through the first table for all question_ids answered by the user, then searching the topics_and_questions table" but that's not really a good analysis of what should happen behind the scenes.

Properly designed, this process would involve checking an index on the first table for all questions answered by the user, joining to the second table to find the topics of the questions (and possibly a third table to find the names of the topics), but all in one query and with no full table scans.

If your first table (no table name provided) has (user_id, question_id) as its primary key (as it probably should), then finding the questions answered by each user will be a fast operation, no matter how many records are in this table. Rows in InnoDB are stored in primary key order, so there's no "searching though" the entire table that needs to happen... the storage engine can go directly to where that user's records are without a full table scan. If (user_id,question_id) isn't the primary key, then adding an index on (user_id,question_id) will optimize this part of the query.

In topics_and_questions either (question_id) or (question_id,topic_id) should probably be the primary key, depending on whether a question can be in multiple topics. If it can, of course, then counting questions answered becomes more complicated.

Since you appear to be allowing a user to answer the same question more than once, there may be yet another issue to consider: you actually have to duplicate part of what you're trying to avoid, every time you update this new proposed table: You have to check whether the question just answered was one that's been answered before, or not, and only increment this counter, if it is, I would assume.