InnoDB and MyISAM each have their strengths and weaknesses.
If you have enough RAM, I would choose InnoDB because it caches data and index pages in the Buffer Pool. MyISAM only caches index pages in the Key Cache.
MyISAM tables experience full table locks for each INSERT, UPDATE, and DELETE. MyISAM tables always require disk access for data.
InnoDB tables always incur disk I/O in the following areas:
- Double Write Buffer : Changes are posted in ibdata1 to avoid OS caching
- Insert Buffer : Changes to Secondary (non-Unique) Indexes as posted in ibdata1
- Data and Indexes
- With innodb_file_per_table = 0, changes are written to ibdata1
- With innodb_file_per_table = 1, changes are written to
.ibd
tablespace file. Read I/O against ibdata1 still necessary to crosscheck table metadata
SUMMARY
In an environment with the following:
- heavy writes
- heavy reads
- tons of RAM
- heavy connections
I would always choose InnoDB. Please check out my other post about InnoDB over MyISAM : When to switch from MyISAM to InnoDB?
When would I every choose MyISAM?
Under the following scenario
- Using MySQL Replication
- Master with all InnoDB
- Slave with all tables converted to MyISAM
ALTER TABLE ... ROW_FORMAT=Fixed
for all tables on the Slave
Disk I/O wise, MyISAM has a slight edge with ROW_FORMAT-Fixed
because you only interact with one file, the .MYD
file. The row size is completely predictable because VARCHAR is treated as CHAR this shortening access time for data retrieval.
On the other hand, InnoDB has to interact with multiple files (ibdata1, serveral read/write threads upon the .ibd of the InnoDB table).
Is there really a limit to that design and if yes, how can it be dealt with? 1.1. If the select query on votes table will be getting slower, what can I do to speed it up?
I don't think the number of votes is likely to be the problem. The questions will have to do in part with questions of how well you can index, how your db does caching, etc. Standard performance tuning applies and that isn't really your design per se. I will answer more below on what to consider if you run into the wall of being unable to get your design to work fast enough.
Is there a better way to design this kind of relations?
Not really.
How do I cache that data? Or is that even needed with proper indexing?
My preference in this case would be to start out without caching, and then to implement a caching layer when you need one. A caching layer might include something like memcached, or you could build one on a NoSQL solution like Mongo. At that point you can look at optimizing the areas which are the largest problems.
What kind of indexes would you recommend for the votes table? Am I correct that I need a simple double-field index (user_id, content_id)?
I know that MySQL and PostgreSQL are different enough to make cross-db somewhat dangerous here but I am thinking you'd want two indexes, one on content_id and one on user_id. I am thinking this because aggregating by user_id and content_id are likely to be different queries and these are different join conditions.
Most of the load will go on recent content pieces, maybe I should create something like recent_votes table, which will hold duplicate data, but only for the last say 24 hours and most load will go on it, and if user wants some data that is older, he will work with much bigger and slower table with all votes? Does that make any sense?
Keep in mind that db's frequently do a good job of caching recent content pieces. I would expect that MySQL can do this too. If it can't go with PostgreSQL instead. Don't cache it yourself in the db.
what to do if you hit the wall will depend on your DB choice. If you are using MySQL, your traditional answer is to look at something like memcached or create a caching layer in a NoSQL db. If you are using PostgreSQL, you get those choices plus something like Postgres-XC which gives you an ability to do teradata-style scaling out and clustering in OLTP environments.
Best Answer
In scalability, part of the problem is that this is pretty heavily use-case dependent. There is no scaling solution that hits every use case equally. For example, Slony is very helpful for scaling out in some cases, but is a bear in others. For example, Slony lets you replicate only part of your database. On the other hand, Slony lets you replicate only part of your database.... Overall, MySQL seems a little easier to scale out of the box in the workloads it supports but PostgreSQL seems to offer more advanced tools which you can take quite a bit further.
As far as which is more scalable in common use cases, now that Postgres-XC has been released, PostgreSQL is the clear winner here. We have Slony, Bucardo, Streaming Replication, PGPool, etc. on the replication side. We have Postgres-XC on the scale out clustering side. We are extremely scalable now.
However, all these come with complexity costs. These costs appear a bit higher on the Pg side, but the flexibility is there to scale however you need to.
Scaling up right now is also a major PostgreSQL goal. 9.2 makes some important gains in that regard.