Modeling a database for easy counting / reporting

database-design

I have an app where user is known (user_id) and he can do several actions (action_id).
Every time he makes an action I need to save the fact that he made it for reports/analytics. I guess it is similar to other analytic solutions and their db design.

Once I have the data, provided with a time window (minutes resolution) I need to count for each user (all or some) the number of times he did actions and which actions he did. (sum all data grouped by action_id).

Some assumptions:

  • The number of users are ~1000.
  • Action types are ~100.
  • Actions can happen 24/7.
  • The time windows can span from minutes to days and are random.
  • A time window can't go back more than 30 days.

I'm considering SQL, NoSQL and RRD to save the data.

I put RRD here because it's easy to implement the insert of the data into statds+graphite. I'm concerned if I take this approach, the querying (although provided by graphite) will not be indexed and will probably have to count all the data whenever I ask for a window/user (no indexing). Another problem is that when querying all the data, all users info will be needed, resulting in reading all the files concurrently which I'm not sure is a good thing.

SQL – Very easy implementation when inserting the data and querying. Easy to index, order and group by. However I'm not sure it's easy if I'm anticipating high traffic. Also, I'm not sure how effective is the count() of sql (haven't used SQL in the last few years) after group by. Can it offer parallel computation?

NoSQL – Is there a solution out there that is the right fit for this type of scenario (perhaps a Map/Reduce algorithm to fast generation of counts in a time window?)

Thanks for helping me model

Best Answer

I think the essential approach should be to keep it simple. It will save you frustration and unplanned maintenance. The table design is simple a parent (users) and a child (actions) with a one-to-many foreign key relationship. The data you need will be easy to query.

Lets assume active user actions are 500 users 20 actions 100 per day for 30 days. That's about 30 million rows, a regular day's work for a database.

It sounds like you are most comfortable with a standard SQL relational database. That approach is also widely used, with good community and tool support, and easily maintainable.

So you have a simple design, a feasible architecture, and good comfort level with a SQL relational database. Go with that.

Related Question