Mysql – Database design suggestions for a data scraping/warehouse application

data-warehousedatabase-designdatabase-recommendationMySQLoptimization

I'm looking into the database design for a data warehouse kind of project which involves a large number of inserts daily. The data archives will be further used to generate reports. I will have a list of users (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.

For example, let there be a set of 100 users say U1, U2, …, U100.

I need to insert their daily status count into my database.

Consider the total status count obtained for a user U1 for period June 30 – July 6, is as follows

June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107

The database should keep daily status count of each user, like for user U1:

July 1- 1 (100-99)
July 2- 2 (102-100) 
July 3- 0 (102-102) 
July 4- 3 (105-102) 
July 5- 0 (105-105) 
July 6- 2 (107-105) 

Similarly the database should hold daily details of the full set of users.

And on a later phase I envision taking aggregate reports out of these data, like total points scored on each day, week, month, etc; and to compare it with older data.

I need to start things from scratch. I am experienced with PHP as a server side script and MySQL. I am confused on the database side. Since I need to process about a million insertions daily, what are all the things that should be taken care of?

I am confused on how to design a MySQL database in this regard. Which storage engine to use and which design patterns should be followed, keeping in mind the data could later used effectively with aggregate functions?

Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.

Does MySQL fit my requirement? 2 million or more DB operations are done every day. How are the server and other things to be considered in this case?

EDIT:

Queries Involved:

INSERTION QUERIES

Insertion queries should be capable of inserting 1-2 million inserts every day. (We don't have update here.)

RETREIVAL QUERIES

1.Sum of statuses for the whole set of users.

2.Sum of statuses for set of users under a geographic location.

3.Comparing status counts with days/weeks/months.

–>I believe some kind of indexes are needed in this case, but I read indexes could slow down insertion.

–>Also I have heard MyISAM would be a better choice than InnoDB considering speed aspects.

Please advise?

Best Answer

These are general recommendations, as you do not show the full extent of your queries to be performed (which kind of analytics you plan to do).

Assuming you do not need real time results, you should just denormalize your data at the end of the period, precalculate once your aggregated results on all necessary timeframes -by day, by week, by month-, and work only with summary tables. Depending on the queries you intend to do, you may not even need the original data.

If durability is not a problem (you can always recalculate statistics as raw data is elsewhere), you can use a caching mechanism (external, or MySQL 5.6 includes memcache), which works great for writing and reading key-value data on memory.

Use partitioning (can also be done manually), as with these kind of applications, usually the most frequently accessed rows are also the most recent. Delete or archive old rows to other tables to use our memory efficiently.

Use Innodb if you want durability, high concurrent writes and your most frequent accessed data is going to fit into memory. There is also TokuDB- it may not be raw faster, but it scales better when dealing with insertions on huge, tall tables and allows for compression on disk. There are also analytic-focused engines like Infobright.

Edit:

23 insertions/second is feasible in any storage with a bad disk but:

  • You do not want to use MyISAM- it cannot do concurrent writes (except on very specific conditions) and you do not want to have huge tables that become corrupted and lose data

  • InnoDB is fully durable by default, for better performance you may want to reduce the durability or have a good backend (disk caches). InnoDB tends to get slower on insertion with huge tables. The definition of huge is "the upper parts of the Primary key/other unique indexes must fit into the buffer pool" to check for uniqness. That can vary depending on the memory available. If you want scalability beyond that you have to partition (as I suggested above)/shard or use any of the alternative engines I mentioned before (TokuDB).

SUM() statistics do not scale on normal MySQL engines. An index increases performance, again, because most of the operations can be done on-memory, but one entry for each row has to still be read, in a single thread. I mentioned design alternatives (summary tables, caching) and alternative engines (column-based) as a solution to that. But if you do not need real-time result, but report-like queries, you shouldn't worry too much about that.

I suggest you to do a quick load test with fake data. I've had many clients doing analytics on MySQL of social networks without problems (well, at least, after I helped them :-) ), but you decision may depend on your actual non-functional requisites.