These are general recommendations, as you do not show the full extent of your queries to be performed (which kind of analytics you plan to do).
Assuming you do not need real time results, you should just denormalize your data at the end of the period, precalculate once your aggregated results on all necessary timeframes -by day, by week, by month-, and work only with summary tables. Depending on the queries you intend to do, you may not even need the original data.
If durability is not a problem (you can always recalculate statistics as raw data is elsewhere), you can use a caching mechanism (external, or MySQL 5.6 includes memcache), which works great for writing and reading key-value data on memory.
Use partitioning (can also be done manually), as with these kind of applications, usually the most frequently accessed rows are also the most recent. Delete or archive old rows to other tables to use our memory efficiently.
Use Innodb if you want durability, high concurrent writes and your most frequent accessed data is going to fit into memory. There is also TokuDB- it may not be raw faster, but it scales better when dealing with insertions on huge, tall tables and allows for compression on disk. There are also analytic-focused engines like Infobright.
Edit:
23 insertions/second is feasible in any storage with a bad disk but:
You do not want to use MyISAM- it cannot do concurrent writes (except on very specific conditions) and you do not want to have huge tables that become corrupted and lose data
InnoDB is fully durable by default, for better performance you may want to reduce the durability or have a good backend (disk caches). InnoDB tends to get slower on insertion with huge tables. The definition of huge is "the upper parts of the Primary key/other unique indexes must fit into the buffer pool" to check for uniqness. That can vary depending on the memory available. If you want scalability beyond that you have to partition (as I suggested above)/shard or use any of the alternative engines I mentioned before (TokuDB).
SUM()
statistics do not scale on normal MySQL engines. An index increases performance, again, because most of the operations can be done on-memory, but one entry for each row has to still be read, in a single thread. I mentioned design alternatives (summary tables, caching) and alternative engines (column-based) as a solution to that. But if you do not need real-time result, but report-like queries, you shouldn't worry too much about that.
I suggest you to do a quick load test with fake data. I've had many clients doing analytics on MySQL of social networks without problems (well, at least, after I helped them :-) ), but you decision may depend on your actual non-functional requisites.
With the given information, I find it difficult to give you a simple answer. As you have discovered, data can be hierarchical with many different logical groupings that could work.
Here is a checklist of things you can do to get a better answer for yourself:
- What are the full set of queries you will want to do? (Among your
schema options, which one(s) minimizes the number of times you're
going to disk to retrieve data for your set of queries?)
- How will your data set scale? (Among your schema choices which one(s) scale gracefully, minimizing the number of fat nodes or other degenerates
that will not scale well?)
- What kind of churn do you expect? How much are you willing to slow writes in order to accelerate reads?
I hope this helps.
Best Answer
Sounds on the surface like a graph database problem. If you're going to be walking the edges between users, neo4j or such like may be the one for you.
You might be able to do more generic processing using a document db where every user has an _id of user_id and an array of followers _ids.
Perhaps you could output to MongoDb, then use Neo4j for creating the graph(s) for specialised work, and mongodb for more general work. MapReduce and the aggregation framework in MongoDb are pretty good (speaking from experience, although MapReduce is much more powerful than aggregrtion framework (currently)).
Since the schema is likely to morph, and you do not know what the additional data will be, you might prefer a doc or graph db over a RDB. If you prefer to work in a relational manner at a later point, you can generate csv extracts to upload to your RDBMS of choice after you have defined a schema.