Your abstract requirements scream "PostgreSQL" to me. However, I think it's worth staying abreast of what the bourgeoisie are up to, so here's a list of various stuff you might want to check into.
Free stuff
- CouchDB - one of the first NoSQL databases, powerful map/reduce querying system, highly distributed and fault tolerant. One of the better NoSQL contenders.
- Hyperdex - very new, distributed hash table with search capabilities.
- Riak - distributed hash table worthy of some respect.
Weird free stuff
- Metakit - more of an embedded database like SQLite but not SQL-based, so more procedural.
- FramerD - much like a classic "network" database, very pointer-centric. Perhaps dead?
- Magma - Smalltalk OODBMS. Cool but not well documented.
Non-free stuff
- AllegroGraph - RDF (graph) database, supports SPARQL. Lisp-flavored.
- Caché - a hybrid relational/OO database, originally based on MUMPS (IIRC).
- Objectivity - One of the last few really big OODBs. Very powerful, impressive and expensive.
- VoltDB - Highly scalable mostly relational database. Supports "most" SQL. Very new. I guess they have a community version too.
Conclusion
I have not used any of these things extensively. I have played with most of them a little bit and always wound up back with PostgreSQL. Looking at your requirements, the only one PostgreSQL doesn't meet out of the box is scalability. On the other hand, for my purposes it's much easier to throw $4000 of hardware at a single dedicated database machine than to throw $4000 of cloud nodes or low-end machines at this problem. And there are ways of achieving scalability with PostgreSQL, such as with EnterpriseDB.
It's great fun to play around with these things on the side, but when it comes time to put valuable, irreproducible production data into something, a bunch of boring attributes like reliability, stability and long-term viability wind up coming to the fore.
Thought experiment for you
Consider this. Imagine you're Mark Zuckerberg, and you have to choose either to give up your codebase or your data. You can keep all your development staff, but you either have to give up all your code—every line, say even all the developers memories of how they implemented everything is gone—but you get to keep all your user accounts and all your users uploaded data and all that, or you can give up all the data. Keep all the structures and servers and configuration, the setup, but lose every row in every table in every database.
It should be obvious that it would be worse to lose the data. Why would all your users regenerate all that data? Think of all the marketing data lost, which is how Facebook actually makes their money. And there are tons of entrepreneurs salivating at the opportunity to get people to use their Facebook clone—now all those disenfranchised ex-Facebook users would be out there considering alternatives. On the other hand, if they lost the codebase, they could rebuild it, probably even better than it is now, but they could have something online in very short order. Heck—they could probably buy someone else's Facebook clone codebase and load it up with the real data, but you can't just copy their data. If Facebook still has everyone's important data on their servers, the incentive to leave is much lower. Still bad, but much less so. Surprisingly less so.
The irony is that it is much easier to lose all your data in a freak accident than to lose all your code. For most internet companies, though, the data is the company, it is your most valuable asset. And this is a strong reason to consider using a traditional, time-tested, old-fashioned, unsexy relational database.
It looks like it is probably following an index on CreatedDate
in order from lowest to highest and doing lookups to evaluate the SomeIndexedValue = 1
predicate.
When it finds the first matching row it is done, but it may well be doing many more lookups than it expects before it finds such a row (it assumes the rows matching the predicate are randomly distributed according to date.)
See my answer here for a similar issue
The ideal index for this query would be one on SomeIndexedValue, CreatedDate
. Assuming that you can't add that or at least make your existing index on SomeIndexedValue
cover CreatedDate
as an included column then you could try rewriting the query as follows
SELECT MIN(DATEADD(DAY, 0, CreatedDate)) AS CreatedDate
FROM MyTable
WHERE SomeIndexedValue = 1
to prevent it from using that particular plan.
Best Answer
What would address your question is the subject JOIN DECOMPOSITION.
According to Page 209 of the Book
You can decompose a join by running multiple single-table queries instead of a multitable join, and then performing the join in the application. For example, instead of this single query:
You might run these queries:
Why on earth would you do this ? It looks wasteful at first glance, because you've increased the number of queries without getting anything in return. However, such restructuring can actually give significant performance advantages:
mysql
is already cached, the application will skip the first query. If you find posts with an ID of 123, 567, or 908 in the cache, you can remove them from theIN()
list. The query cache might also benefit from this strategy. If only one of the tables changes frequently, decomposing a join can reduce the number of cache invalidations.IN()
list instead of a join lets MySQL sort row IDs and retrieve rows more optimally than might be possible with a join.As a result, doings joins in the application can be more efficient when you cache and reuse a lot of data from earlier queries, you distribute data across multiple servers, you replace joins with
IN()
lists, or a join refers to the same table multiple times.OBSERVATION
I like the first bulletpoint because InnoDB is a little heavy-handed when it crosschecks the query cache.
Sep 05, 2012
: Is the overhead of frequent query cache invalidation ever worth it?Jun 07, 2014
: Why query_cache_type is disabled by default start from MySQL 5.6?As for the last bulletpoint, I wrote a post back on Mar 11, 2013 (Is there an execution difference between a JOIN condition and a WHERE condition?) that describes the nested loop algorithm. After reading it, you will see how good join decomposition may be.
As for all other points from the book, the developers really look for performance as the bottom line. Some rely on external means (outside of the application) for performance enhancements such as using a fast disk, get more CPUs/Cores, tuning the storage engine, and tuning the configuration file. Others will buckle down and write better code. Some may resort to coding all the business intelligence in Stored Procedures but still not apply join decomposition (See What are the arguments against or for putting application logic in the database layer? along with the other posts). It's all up to the culture and tolerance of each developer shop.
Some may be satisfied with performance and not touch the code anymore. Other simply don't realize there are great benefits one can reap if they try join composition.
For those developers that are willing ...
GIVE IT A TRY !!!