Seriously look at third normal form. I would use surrogate keys with the natural keys implemented as unique keys. You will likely find that author belongs in its own authors table. You may find you have a few tables which are quite similar such as user_content_faves, user_author_faves, user_author_shares. This is normal.
Having a single content table with a content_type column may be appropriate. The content column would need to be capable of storing all the content types.
EDIT: For relationship tables I usually name the join table by concatenating the names of the joined tables, abbreviating as necessary. If there multiple relationships between the two tables, I use one of two options:
- Append the purpose of the relationship (as I did above); or
- Add a type/reason code to the relationship (in which case the type is not needed in the relationship name).
I made the assumption that you would want to track who favorited or shared things. It appears that both the producer (author or user) as well as the content items. Therefore you have users favoriting a producer (user_author_faves now user_user_faves), or a product (user_content_faves). Depending on how you do sharing,
- it could be an attribute on the content, or
- a relationship like user_content_shares, where content is share with a particular user. Re-sharing could be problematic, if you track who shared things, and multiple users share the same content to the same user. Un-sharing re-shared items is problematic if you don't track who did the sharing.
You may want to consider (and set policies for):
- tracking/audit information like when something was done (added, favorited, shared, etc).
- whether to do physical or logical deletes.
- if you do logical deletes how to handle refavoriing or sharing something after a logical delete.
When indexing relationship tables I generally have the primary key consisting of the primary keys of the two tables being joined. A second index with the primary keys reversed, or just the primary key which is the second column in the primary key is usually required. If the relationship between two rows can occur more than once, the column(s) used to differentiate the reason/type and/or timing (since date) of the relationship needs to added to the primary key.
In my opinion, the key differentiator of true SOA systems (over the pseudo SOA, ntier/distributed systems that are becoming ubiquitous) is that there should be zero interaction between discrete services. Where this is achieved, any application you compose from these services can and should be built to tolerate the failure of any consistuent part. A failure reduces functionality but service is maintained.
In this scenario it's logical, or required, to separate the underlying database for each service. If however you have services which are interdependent, there is little (perhaps nothing) to be gained from a split.
I'd recommend reading sites such as HighScalability.com which dig into the architectures adopted by the never-fail type websites. One of my favourites of late was the story of the Netflix Chaos Monkey which was mentioned on Coding Horror.
Addressing a couple of the points in your question:
In the event of a disaster, restoring the platform to a consistent
state is easier.
This is true but you should perhaps be thinking about how to better decouple these services so this stops being an issue. Alternatively, there are methods to ensure synchronisation across multiple databases, transaction marks in SQL Server for example.
For data that is referenced by multiple services, data cached by one
service is likely to be used soon after by another service.
Distributed cache solutions (memcached et al) could help here but you'd be violating the service independence principles. This would be comparable to having two services communicating with each other directly, or worse having a service access anothers data store, bypassing the service interface altogether. Inevitably data will be related and will be handed between services by the calling platform, the tricky decisions tend to be around which service will own which pieces of data. StackOverflow or Programmers sites might be better placed to help with the more general SOA issues.
Assuming each database is on separate hardware, scaling up yields more
performance benefits.
Certainly it can be cheaper to scale out across multiple lower spec machines than to scale up a single machine. Although, the lower hardware costs may be dwarfed in the total cost of ownership when the soft costs of additional development effort and operational complexity are factored in.
If this isn't SOA and you just have a case where the component services of this platform are being built by different teams/suppliers for logistical reasons, stick with a single database and completely ignore everything above! :)
Best Answer
the 'cleaner' solution would definitely using a single schema, and relate your data to a specific event, either with a EventID column, or yet better, using an Events table, relating your data to a row in this table (but keep it simple, you don't need to relate every row in every table to your event, just the 'top-most' objects)
but on the other hand, maybe 'You Aint Gonna Need It', a lot can change in a year, and you'll probably have to make a lot of changes and improvements for next years event. anyway you'll find yourself carrying over and migrating data which nobody needs or wants, so you may end up purging your database after all.
i learned to stick to the YAGNI rule the hard way several times, and keeping things simple always proofed to be the better solution. as long as you produce clean and maintainable code, it's often much simpler to add functionality, than to trim unneeded (and often untested) fat