Designing an aggregation schema

database-designschema

The only thing closest to what I'm developing is Facebook's activities log for users. So I figured if I could ask about how FB might possibly implement such a feature at the database layer, then it might give me some insight on how to solve my own similar problem.

How would one design a schema in a MySQL database to hold all the activities of a specific user. In FB's case, you have many activities that a user might be able to do, like something, comment on something, add a friend, use an app, etc. All these activities have their own database schema, I'm sure, so I'm assuming that the activities log schema would need to reference these other data somehow.

So what I'm asking is for ideas. The only ideas I can come up with is to create an activities table and multiple "connecting tables", tables that reference the activities table and the activity data (e.g. comments on something, likes something, etc.). However, this schema would require a new table for every new kind of activity and new business logic to handle these new tables. Maybe one can't get around this, but I'm interested in a schema that would require very little updating to the schema and business logic when new kinds of activities are added to the system.

Just to be clear, this question isn't intended to sound like I'm just simply trying to build another Facebook clone. I only use FB as a reference to something similar to what I'm trying to do. To be more abstract about my question: I'm trying to aggregate various kinds of data in my database into a single schema that could be queried, and anytime there is a new kind of data, there should little if any schema changes, otherwise what's the point? FB has two instances of this kind of aggregation, activity logs and notifications (that I can think of).

Any help is appreciated.

Best Answer

Relational DBMS is not ideally suited to this type of application. RDBMS is great for tracking and enforcing referential integrity for relations between two entities, THING A and THING B. What you want to do is to track relationships between THING A and every other thing in your database (or lots of other things, at least).

Your options, each of which include some form of compromise, include:

  • Sparse Foreign Key Columns: Create an activity table that has mutually exclusive foreign key columns that point to the various kinds of activity that you want to track. This is a fully 3NF way of doing things, but it still requires new columns to be added for new kinds of activities and it results in sparse columns, which people try to avoid as a rule.

  • Forrest of Tables: Create individual activity tables each of which track a particular type of activity. You can then create a query or view that uses UNION ALL to bring the various kinds of activity together into one place for querying. This is also very by the book with respect to normal forms, but you end up with a database cluttered with activity tables and you have to create a new table and alter a view whenever you add a new type of activity.

  • Data Driven Joins: Use a single activity table that captures the type of the activity and the instance of the activity in one place. You would need to have more sophisticated logic built for retrieval of the activity details, since this approach foregoes formal foreign keys. There is no declarative referential integrity and the approach smells a lot like EAV, which makes many people's skin crawl. On the other hand, you wind up with a single table, no sparse columns, and the addition of a new activity type requires no schema changes, only a tweak to one piece of code - the one that reads type of activity and knows what to do to go and retrieve the details of that activity.

You need to look at where your sensitivities are going to lie. How many types of activity will you ultimately be tracking? How often are new types of activity going to be added to the mix? Are you prepared to endure the scorn of colleagues when they see you've implemented a solution with sparse columns (or worse) EAV? When you answer these questions you'll be able to pick the approach which makes the most palatable compromises for your application.