Mysql – Database schema – performance v correctness information

database-designMySQLschema

I have a database that stores information about the users of my website (username, password, etc). I also store information about their activity on the site.

For example, I have a large number of questions on my site. I have a few topics, and each topic has a number of questions.

Whenever a user answers a question, I store the result in a table containing the tuples: (user_id, question_id, timesAnswered, timesCorrect)

I also have a table called topics_and_questions containing (question_id(PK), topic_id(PK)) which can be used to determine which topic(s) the question belongs to.

Based on the above, it is possible to determine how many questions a user has answered on a particular topic. However, this would involve searching through the first table for all question_ids answered by the user, then searching the topics_and_questions table. (The first table could get rather large)

Does it make sense to instead create another table users_and_topics which contains (user_ID, topic_id, questionsAnswered) and update this whenever a question is answered? It would seem as though this would be bad database practise as data would essentially be replicated (this table can be derived from current tables). However, it could provide performance benefits as it would mean less work for the server. I would expect around 1 request for this information every 5 minutes per user of my website.

I'm using innoDB, MySQL database if it matters

Best Answer

I concur with your speculation that it's bad design practice. I'd recommend against it. Any time you do this sort of thing, it almost inevitably comes back to bite and the performance difference compared to a properly designed schema should be small.

You say that as it stands now, your process "would involve searching through the first table for all question_ids answered by the user, then searching the topics_and_questions table" but that's not really a good analysis of what should happen behind the scenes.

Properly designed, this process would involve checking an index on the first table for all questions answered by the user, joining to the second table to find the topics of the questions (and possibly a third table to find the names of the topics), but all in one query and with no full table scans.

If your first table (no table name provided) has (user_id, question_id) as its primary key (as it probably should), then finding the questions answered by each user will be a fast operation, no matter how many records are in this table. Rows in InnoDB are stored in primary key order, so there's no "searching though" the entire table that needs to happen... the storage engine can go directly to where that user's records are without a full table scan. If (user_id,question_id) isn't the primary key, then adding an index on (user_id,question_id) will optimize this part of the query.

In topics_and_questions either (question_id) or (question_id,topic_id) should probably be the primary key, depending on whether a question can be in multiple topics. If it can, of course, then counting questions answered becomes more complicated.

Since you appear to be allowing a user to answer the same question more than once, there may be yet another issue to consider: you actually have to duplicate part of what you're trying to avoid, every time you update this new proposed table: You have to check whether the question just answered was one that's been answered before, or not, and only increment this counter, if it is, I would assume.

Related Solutions

Designing server side database to store user-specific data

There are 2 scenarios to consider. The first is when you have the user take 1 or more questions without a meaningful context such an exam or formal quiz. In which case you solution works. However, the second scenario is about when the user takes a question as part of a formal session. In this case you have the following rules:

A user takes zero, one or more quizes
A quiz contains 1 or more questions
A user provides an answer for zero,1 or more specific question on a specific quiz

Note:

Table names may be different in the solution than from this diagram

A1 = Standard answer for question 1

UA1 = User provided answer for question 1

enter image description here

Mysql – Database design – many to many

Considering that the question is labeled database-design I will offer a solution without that CSV field. Even if you decide to keep it, this may offer some ideas.

Question contains all possible questions.
Answer has all possible answers (to any question)
QuestionAnswer contains all QA combinations offered, note that there must exist a default answer for "I do not know".
UserQA contains questions user answered.
Note that rule "There can be one and only one correct answer for each question" (IsCorrect = 'Y'), has to be implemented on the application level or as a stored procedure/trigger.

enter image description here

To get questions that a user with specific_UserId has not yet answered

select  qq.QuestionID
from Question as qq
where not exists (select 1 
                    from UserQA as xx 
                   where xx.QuestionID = qq.QuestionID
                     and xx.UserID     = specific_UserId );

Or, better if there may be some new questions which do not have defined answers yet

select distinct qq.QuestionID
from QuestionAnswer as qq
where not exists (select 1 
                    from UserQA as xx 
                   where xx.QuestionID = qq.QuestionID
                     and xx.UserID     = specific_UserId );

Best Answer

Related Solutions

Designing server side database to store user-specific data

Mysql – Database design – many to many

Related Question