Your abstract requirements scream "PostgreSQL" to me. However, I think it's worth staying abreast of what the bourgeoisie are up to, so here's a list of various stuff you might want to check into.
Free stuff
- CouchDB - one of the first NoSQL databases, powerful map/reduce querying system, highly distributed and fault tolerant. One of the better NoSQL contenders.
- Hyperdex - very new, distributed hash table with search capabilities.
- Riak - distributed hash table worthy of some respect.
Weird free stuff
- Metakit - more of an embedded database like SQLite but not SQL-based, so more procedural.
- FramerD - much like a classic "network" database, very pointer-centric. Perhaps dead?
- Magma - Smalltalk OODBMS. Cool but not well documented.
Non-free stuff
- AllegroGraph - RDF (graph) database, supports SPARQL. Lisp-flavored.
- Caché - a hybrid relational/OO database, originally based on MUMPS (IIRC).
- Objectivity - One of the last few really big OODBs. Very powerful, impressive and expensive.
- VoltDB - Highly scalable mostly relational database. Supports "most" SQL. Very new. I guess they have a community version too.
Conclusion
I have not used any of these things extensively. I have played with most of them a little bit and always wound up back with PostgreSQL. Looking at your requirements, the only one PostgreSQL doesn't meet out of the box is scalability. On the other hand, for my purposes it's much easier to throw $4000 of hardware at a single dedicated database machine than to throw $4000 of cloud nodes or low-end machines at this problem. And there are ways of achieving scalability with PostgreSQL, such as with EnterpriseDB.
It's great fun to play around with these things on the side, but when it comes time to put valuable, irreproducible production data into something, a bunch of boring attributes like reliability, stability and long-term viability wind up coming to the fore.
Thought experiment for you
Consider this. Imagine you're Mark Zuckerberg, and you have to choose either to give up your codebase or your data. You can keep all your development staff, but you either have to give up all your code—every line, say even all the developers memories of how they implemented everything is gone—but you get to keep all your user accounts and all your users uploaded data and all that, or you can give up all the data. Keep all the structures and servers and configuration, the setup, but lose every row in every table in every database.
It should be obvious that it would be worse to lose the data. Why would all your users regenerate all that data? Think of all the marketing data lost, which is how Facebook actually makes their money. And there are tons of entrepreneurs salivating at the opportunity to get people to use their Facebook clone—now all those disenfranchised ex-Facebook users would be out there considering alternatives. On the other hand, if they lost the codebase, they could rebuild it, probably even better than it is now, but they could have something online in very short order. Heck—they could probably buy someone else's Facebook clone codebase and load it up with the real data, but you can't just copy their data. If Facebook still has everyone's important data on their servers, the incentive to leave is much lower. Still bad, but much less so. Surprisingly less so.
The irony is that it is much easier to lose all your data in a freak accident than to lose all your code. For most internet companies, though, the data is the company, it is your most valuable asset. And this is a strong reason to consider using a traditional, time-tested, old-fashioned, unsexy relational database.
Option 2 is known as "EAV" or Entity-Attribute-Value
- not relational
- no DB level constraints
- requires contortions to read the data unless a simple list
But, it depends what you mean by "settings". If you have a few 1000 rows that are not objects and don't require constraints then, yes, use this pattern. This is what SQL Server does with sys.configurations
If you are trying to have a "flexible schema" that can store anything, then simply don't. It will end in tears. Also see EAV questions here on DBA.SE
Note that "extra columns" (option 1) allows you to define defaults and datatype-safety, whereas "lack of row" (option 2) requires a default value to be stored in code and everything is a string in the database
"It depends"
Best Answer
Will you be
INSERTing
one row per click? Or just incrementing a counter?INSERTing
orUPDATEing
as many as 100K rows in one second requires some special planning. First, plan on gathering a bunch of rows, then batching the insert or update; do not try to insert/update one click at a time. (More on this in a minute.)Build and maintain "summary tables" so that the queries will not have such a significant impact the system. (More to come.)
I would suggest a Galera Cluster (MariaDB 10 / PXC / Mysql+Galera) to give you some scaling, HA, etc. Since you have a large read requirement, you might need multiple slaves hanging off each cluster node.
Backups could be handled by taking a slave offline. (There are other options; I don't see it as much of a problem.)
You have not explained the 85% reads enough for me to finish this discussion; I will focus on the writes and some of the reads.
Use SSDs of sufficient size. Have sufficient network capacity.
Have enough client machines -- possibly more than MySQL servers; the number depends on several things. The design should easily grow by adding clients, so I am not too worried about how many clients until the system is being prototyped.
The best high-speed ingestion I know about is here . It allows multiple clients to insert into a temp table, ping-pong to another table while doing the
INSERTs
(orUPDATEs
). If all you are doing is counting hits, this design takes a load off the main table because it would do some of the tallying in the tmp table. Also, you would probably useINSERT ... ON DUPLICATE KEY UPDATE ... SELECT ...
to move the data from the temp to real table. (Caveat: The blog was written with a single Master in mind; I need to do some research to tailor it to Galera, especially if this is happening on all 3 nodes. Initial thoughts suggest that only minor tweaks are needed.)Normalization can, and should, be done from the tmp table. (The blog discusses that.)
With Summary table(s), the 85% estimate for reads may drop significantly. The idea is to create and maintain tables that are subtotals over an hour or day. Then the queries actually hit the summary tables and run a lot faster than against the real table. (Sometimes 10x faster.) The summary table(s) may be denormalized for extra convenience/speed. The Summary table updates are likely to be IODKU from the tmp table.
I need to see more details about the schema and the tentative inserts/updates and selects. This entire discussion will probably expand beyond what can/should be handled in a free forum.
Reference: Summary table blog