Mysql – ny benefit to this additional table

MySQLperformance

Sorry for the vague question title, I'm not very experienced with db administration

Essentially I have a web application that allows polls to be created with any amount of options, then users can answer a poll and select their options and assign a priority to each answer (first choice, second choice, etc)

So there's a poll table, poll_options table, and a votes table. For each answer a user selects, a new row is added to the votes table

+------+---------+----------------+---------+----------+
| id   | poll_id | poll_option_id | user_id | priority |
+------+---------+----------------+---------+----------+
| 1    | 7       | 4              | 16      | 1        | 
| 2    | 7       | 1              | 20      | 1        | 
| 3    | 7       | 3              | 20      | 2        | 
| 4    | 8       | 1              | 16      | 1        | 
| 5    | 8       | 2              | 16      | 2        | 
| 6    | 8       | 3              | 16      | 3        | 
| 7    | 8       | 4              | 16      | 4        | 
| 8    | 8       | 3              | 2       | 1        | 
| 9    | 8       | 1              | 2       | 2        | 
+------+---------+----------------+---------+----------+

Now, assuming that table is acceptable, what comes to mind is that if there are 10 polls with 10 options each and 10 users, and they all vote for each thing on every poll… that's 1000 rows already – if I need to check if a user has voted in a poll already, I have to look up the current poll ID and user ID

SELECT COUNT(*) FROM votes WHERE user_id = 16 AND poll_id = 7
1

Is it bad design that it has to go through thousands of rows to find that? Would it be better to have a smaller table between the polls table and the votes, for checking if a user has voted in a poll, all the polls the user has voted in, etc.? Or would this have no performance benefit over my current design?

Would look something like this I guess

+------+---------+---------+
| id   | poll_id | user_id |
+------+---------+---------+
| 1    | 7       | 16      |
| 2    | 7       | 20      |
| 3    | 8       | 16      |
| 4    | 8       | 2       |

Then I'd add the id from this new vote_item table to the votes table as a FK (?) so I can grab the vote_item id via the user_id and poll_id, and return all the rows in the votes table with that vote_item_id

Best Answer

If I need to check if a user has voted in a poll already, I have to look up the current poll ID and user ID

   SELECT COUNT(*) FROM votes WHERE user_id = 16 AND poll_id = 7 ;

This would be efficient with an index on (user_id, poll_id). But if you don't need the count but just whether a user has taken a poll, you only need an EXISTS subquery. Either SELECT EXISTS (...) or WHERE EXISTS (...), depending on what you want to do with this check:

... EXISTS (SELECT * FROM votes WHERE user_id = 16 AND poll_id = 7)

It will be a bit more efficient than the COUNT() query - assuming that the above index has been added.

Is it bad design that it has to go through thousands of rows to find that?

No, if you have the index, it won't go through thousands or millions of rows. It will do a single index seek.

Would it be better to have a smaller table between the polls table and the votes, for checking if a user has voted in a poll, all the polls the user has voted in, etc.?

It might be better, yes. The smaller (in number of rows) table means that the indexes will be smaller, too. So you would use an EXISTS subquery on a smaller index. For the specific query, the difference in efficiency would be very small though. For different queries, say "How nay users have taken this poll?" or *"How many users have taken each poll?", you'd get larger benefit as they require a scan of the whole index.

Or would this have no performance benefit over my current design?

So, it depends on what kind of queries you have.

Would look something like this I guess

    +------+---------+---------+
    | id   | poll_id | user_id |
    +------+---------+---------+
    | 1    | 7       | 16      |
    | 2    | 7       | 20      |
    | 3    | 8       | 16      |
    | 4    | 8       | 2       |
    

Then I'd add the id from this new vote_item table to the votes table as a FK (?) so I can grab the vote_item id via the user_id and poll_id, and return all the rows in the votes table with that vote_item_id.

Not exactly. The table only needs user_id and poll_id and a UNIQUE constraint on (poll_id, user_id). The id is useless for this many-to-many table. In most many-to-many tables, it's common to have two unique indexes, on (a,b) and (b,a). So, I suggest you have that (poll_id, user_id) as the primary key and a unique index on (user_id, poll_id) in the new poll_users table, if you decide to add this table.

You are right about the FOREIGN KEY though. You would add a foreign key from votes (poll_id, user_id) that REFERENCES poll_users (poll_id, user_id) and remove the individual foreign keys from votes to polls (poll_id) and users (user_id).

By the way the id in the votes looks useless, too. I'd have a UNIQUE constraint on (poll_id, user_id, poll_option_id) (meaning: no user can answer the same poll option in a poll twice) and throw away that id.