Mysql – Database design for voting module with long-run and high-load capability

MySQL

I'm currently designing a project and I need professional advice from DBA.

My project will feature voting system somewhat similar to the one used in stack exchage websites. I have users and content pieces and users can vote for content pieces they like or don't. Note, that I will have vote up/down option on the feed list, so if I load 30 content pieces I will need to also load the data of user's votes on every piece, since the up/down button should be highlighed if user already has voted for a specific piece. In other words, I expect big load on the votes table. I was thinking of the basic structure like this:

Table users (user_id, ...), table content (content_id, ...), table votes (vote_id, user_id, content_id, datetime, vote). However, I have doubts about this design.

Let's say I have 10k users and 1k content pieces. That's up to 10 million records in table votes. If I start thinking about scaling it up, I can imagine a big problem. Content doesn't go anywhere, old votes as well, so the longer website runs, the more records will be in the table and the slower it will work.

Let's say in some years I will have 100k users and 20k content pieces. That's up to 2 billion records. I understand that not every user will vote on every content piece, but the problem however is clear – there is a limit to that design (by limit I mean that select query will be slow when the amount of rows will reach some point).

So my questions are:

  1. Is there really a limit to that design and if yes, how can it be dealt with?
    1.1. If the select query on votes table will be getting slower, what can I do to speed it up?
  2. Is there a better way to design this kind of relations?
  3. How do I cache that data? Or is that even needed with proper indexing?
  4. What kind of indexes would you recommend for the votes table? Am I correct that I need a simple double-field index (user_id, content_id)?
  5. Most of the load will go on recent content pieces, maybe I should create something like recent_votes table, which will hold duplicate data, but only for the last say 24 hours and most load will go on it, and if user wants some data that is older, he will work with much bigger and slower table with all votes? Does that make any sense?

I really would like to do things right from the start so in a few years I won't end up with a slow website. Thank you for your time.

Best Answer

Is there really a limit to that design and if yes, how can it be dealt with? 1.1. If the select query on votes table will be getting slower, what can I do to speed it up?

I don't think the number of votes is likely to be the problem. The questions will have to do in part with questions of how well you can index, how your db does caching, etc. Standard performance tuning applies and that isn't really your design per se. I will answer more below on what to consider if you run into the wall of being unable to get your design to work fast enough.

Is there a better way to design this kind of relations?

Not really.

How do I cache that data? Or is that even needed with proper indexing?

My preference in this case would be to start out without caching, and then to implement a caching layer when you need one. A caching layer might include something like memcached, or you could build one on a NoSQL solution like Mongo. At that point you can look at optimizing the areas which are the largest problems.

What kind of indexes would you recommend for the votes table? Am I correct that I need a simple double-field index (user_id, content_id)?

I know that MySQL and PostgreSQL are different enough to make cross-db somewhat dangerous here but I am thinking you'd want two indexes, one on content_id and one on user_id. I am thinking this because aggregating by user_id and content_id are likely to be different queries and these are different join conditions.

Most of the load will go on recent content pieces, maybe I should create something like recent_votes table, which will hold duplicate data, but only for the last say 24 hours and most load will go on it, and if user wants some data that is older, he will work with much bigger and slower table with all votes? Does that make any sense?

Keep in mind that db's frequently do a good job of caching recent content pieces. I would expect that MySQL can do this too. If it can't go with PostgreSQL instead. Don't cache it yourself in the db.

what to do if you hit the wall will depend on your DB choice. If you are using MySQL, your traditional answer is to look at something like memcached or create a caching layer in a NoSQL db. If you are using PostgreSQL, you get those choices plus something like Postgres-XC which gives you an ability to do teradata-style scaling out and clustering in OLTP environments.