Mysql – ny benefit to this additional table

MySQLperformance

Sorry for the vague question title, I'm not very experienced with db administration

Essentially I have a web application that allows polls to be created with any amount of options, then users can answer a poll and select their options and assign a priority to each answer (first choice, second choice, etc)

So there's a poll table, poll_options table, and a votes table. For each answer a user selects, a new row is added to the votes table

+------+---------+----------------+---------+----------+
| id   | poll_id | poll_option_id | user_id | priority |
+------+---------+----------------+---------+----------+
| 1    | 7       | 4              | 16      | 1        | 
| 2    | 7       | 1              | 20      | 1        | 
| 3    | 7       | 3              | 20      | 2        | 
| 4    | 8       | 1              | 16      | 1        | 
| 5    | 8       | 2              | 16      | 2        | 
| 6    | 8       | 3              | 16      | 3        | 
| 7    | 8       | 4              | 16      | 4        | 
| 8    | 8       | 3              | 2       | 1        | 
| 9    | 8       | 1              | 2       | 2        | 
+------+---------+----------------+---------+----------+

Now, assuming that table is acceptable, what comes to mind is that if there are 10 polls with 10 options each and 10 users, and they all vote for each thing on every poll… that's 1000 rows already – if I need to check if a user has voted in a poll already, I have to look up the current poll ID and user ID

SELECT COUNT(*) FROM votes WHERE user_id = 16 AND poll_id = 7
1

Is it bad design that it has to go through thousands of rows to find that? Would it be better to have a smaller table between the polls table and the votes, for checking if a user has voted in a poll, all the polls the user has voted in, etc.? Or would this have no performance benefit over my current design?

Would look something like this I guess

+------+---------+---------+
| id   | poll_id | user_id |
+------+---------+---------+
| 1    | 7       | 16      |
| 2    | 7       | 20      |
| 3    | 8       | 16      |
| 4    | 8       | 2       |

Then I'd add the id from this new vote_item table to the votes table as a FK (?) so I can grab the vote_item id via the user_id and poll_id, and return all the rows in the votes table with that vote_item_id

Best Answer

If I need to check if a user has voted in a poll already, I have to look up the current poll ID and user ID
   SELECT COUNT(*) FROM votes WHERE user_id = 16 AND poll_id = 7 ;

This would be efficient with an index on (user_id, poll_id). But if you don't need the count but just whether a user has taken a poll, you only need an EXISTS subquery. Either SELECT EXISTS (...) or WHERE EXISTS (...), depending on what you want to do with this check:

... EXISTS (SELECT * FROM votes WHERE user_id = 16 AND poll_id = 7)

It will be a bit more efficient than the COUNT() query - assuming that the above index has been added.

Is it bad design that it has to go through thousands of rows to find that?

No, if you have the index, it won't go through thousands or millions of rows. It will do a single index seek.

Would it be better to have a smaller table between the polls table and the votes, for checking if a user has voted in a poll, all the polls the user has voted in, etc.?

It might be better, yes. The smaller (in number of rows) table means that the indexes will be smaller, too. So you would use an EXISTS subquery on a smaller index. For the specific query, the difference in efficiency would be very small though. For different queries, say "How nay users have taken this poll?" or *"How many users have taken each poll?", you'd get larger benefit as they require a scan of the whole index.

Or would this have no performance benefit over my current design?

So, it depends on what kind of queries you have.

Would look something like this I guess

    +------+---------+---------+
    | id   | poll_id | user_id |
    +------+---------+---------+
    | 1    | 7       | 16      |
    | 2    | 7       | 20      |
    | 3    | 8       | 16      |
    | 4    | 8       | 2       |

Then I'd add the id from this new vote_item table to the votes table as a FK (?) so I can grab the vote_item id via the user_id and poll_id, and return all the rows in the votes table with that vote_item_id.

Not exactly. The table only needs user_id and poll_id and a UNIQUE constraint on (poll_id, user_id). The id is useless for this many-to-many table. In most many-to-many tables, it's common to have two unique indexes, on (a,b) and (b,a). So, I suggest you have that (poll_id, user_id) as the primary key and a unique index on (user_id, poll_id) in the new poll_users table, if you decide to add this table.

You are right about the FOREIGN KEY though. You would add a foreign key from votes (poll_id, user_id) that REFERENCES poll_users (poll_id, user_id) and remove the individual foreign keys from votes to polls (poll_id) and users (user_id).

By the way the id in the votes looks useless, too. I'd have a UNIQUE constraint on (poll_id, user_id, poll_option_id) (meaning: no user can answer the same poll option in a poll twice) and throw away that id.

Related Solutions

Mysql – show results from 3 tables excluding NULL or EMPTY values for certain fields

Try this:

SELECT id, 
  title, 
  pollid, 
  COUNT(pollid) AS clean_count,
  group_concat(DISTINCT u.name) AS user
FROM 
(
  SELECT DISTINCT user_id, pollid 
  FROM tbl_votes
) AS tmp_tbl 
JOIN tbl_polls 
  ON tbl_polls.id = tmp_tbl.pollid
JOIN tbl_users AS u 
ON tmp_tbl.user_id = u.user_id
WHERE not isnull(tmp_tbl.user_id) 
GROUP BY pollid 
ORDER BY clean_count DESC

This is the sample data that I used:

    CREATE TABLE tbl_polls (
  id INT AUTO_INCREMENT PRIMARY KEY,
  title VARCHAR(50)
  );


CREATE TABLE tbl_users (
  user_id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(50)
  );

CREATE TABLE tbl_votes (
  id INT AUTO_INCREMENT PRIMARY KEY,
  pollid INT,
  user_id INT,
  INDEX (pollid,user_id),
  FOREIGN KEY (pollid) REFERENCES tbl_polls(id),
  FOREIGN KEY (user_id) REFERENCES tbl_users(user_id)
  );

INSERT INTO tbl_polls (title)
VALUES ('Title 1'),('Title 2'),('Title 3');


INSERT INTO tbl_users (name)
VALUES ('Bob'),('Jack'),('Joe'),('Tom');

INSERT INTO tbl_votes (pollid,user_id)
VALUES (1,2),(1,2),(1,2),(1,3),(2,4),(2,1),(2,4),(2,4),(3,1),(2,1),(1,4);

SQL FIDDLE DEMO

I have used group concat to get the names for your list. Furthermore, to speed up your database have you tried adding index to the foreign keys?

B.T.W. I don't understand your comment that some some votes lack pollid and user_id. If that is the case have a look at this.

Mysql – Database design for voting module with long-run and high-load capability

Is there really a limit to that design and if yes, how can it be dealt with? 1.1. If the select query on votes table will be getting slower, what can I do to speed it up?

I don't think the number of votes is likely to be the problem. The questions will have to do in part with questions of how well you can index, how your db does caching, etc. Standard performance tuning applies and that isn't really your design per se. I will answer more below on what to consider if you run into the wall of being unable to get your design to work fast enough.

Is there a better way to design this kind of relations?

Not really.

How do I cache that data? Or is that even needed with proper indexing?

My preference in this case would be to start out without caching, and then to implement a caching layer when you need one. A caching layer might include something like memcached, or you could build one on a NoSQL solution like Mongo. At that point you can look at optimizing the areas which are the largest problems.

What kind of indexes would you recommend for the votes table? Am I correct that I need a simple double-field index (user_id, content_id)?

I know that MySQL and PostgreSQL are different enough to make cross-db somewhat dangerous here but I am thinking you'd want two indexes, one on content_id and one on user_id. I am thinking this because aggregating by user_id and content_id are likely to be different queries and these are different join conditions.

Most of the load will go on recent content pieces, maybe I should create something like recent_votes table, which will hold duplicate data, but only for the last say 24 hours and most load will go on it, and if user wants some data that is older, he will work with much bigger and slower table with all votes? Does that make any sense?

Keep in mind that db's frequently do a good job of caching recent content pieces. I would expect that MySQL can do this too. If it can't go with PostgreSQL instead. Don't cache it yourself in the db.

what to do if you hit the wall will depend on your DB choice. If you are using MySQL, your traditional answer is to look at something like memcached or create a caching layer in a NoSQL db. If you are using PostgreSQL, you get those choices plus something like Postgres-XC which gives you an ability to do teradata-style scaling out and clustering in OLTP environments.

Best Answer

Related Solutions

Mysql – show results from 3 tables excluding NULL or EMPTY values for certain fields

Mysql – Database design for voting module with long-run and high-load capability

Related Question