To what “scale” of applications does nosql support

database-theorynosqlscalability

Recently i came across the Nosql database concept, though i learnt how to and why of it, i did not get a proper answer to the question , to what scale of project does it support?

Will it support larger application which generally has lot of statistical usage, like google analytics (example)?
Are their solid data or application that are already running on such database like couchbase etc?

Best Answer

This question is really far too vague to answer effectively. There are dozens of "NoSQL" data stores out there which have various use cases. Here is a 10,000 foot view of what's out there.

In my mind, there are basically 3 main categories of NoSQL data stores commonly used, key/value stores, document databases, and big data (hadoop). This is a somewhat artificial designation and many of these products can arguably cross into multiple areas. There are some other categories, such as graph databases, which are more specialized towards a specific problem and I am not going to discuss them here as I have no expertise about them.

Most NoSQL databases are simple key/value stores which are very fast when retrieving named keys. They are particularly inefficient if you need to scan or aggregate over sets of data. Examples of k/v stores are memcached, Riak, Redis, CouchBase, Voldemort, and Amazon DynamoDB. With the HandlerSocket plugin (built into Percona Server), even MySQL can be used as a very fast k/v store. Each of these k/v stores have different feature sets designed to solve different problems. Very few of them are suitable as the authoritative/primary data store for an application because of how difficult and inefficient it is to perform set operations. These are mostly used as caching layers or storing very specialized data that does not require relational operations.

Another general class of NoSQL databases are document stores. Examples include MongoDB and Cassandra. These types of data stores store more structured data than k/v stores and often have a more capable query language. They have flexible "schemas" that make it possible to keep completely different sets of data from one row to the next.

Finally, you get to the true "Big Data" stores of which Hadoop and it's related query languages, pig, hive (a SQL interface to hadoop), and hbase (a real-time data store on top of hadoop/HDFS) is king. With the exception of hbase, Hadoop-based data stores tend to be built for offline processing of truly enormous data sets across hundreds of machines.

As a side note, what drives me absolutely batty about "NoSQL" is that it has literally nothing to do with the SQL language. NoSQL is about reinventing the data storage layer and making it more "scalable" (another vague, misunderstood term) and highly available. The query language is irrelevant in most cases and some of these data stores have produced just horribly ugly ways of performing even the most simple operations. SQL could be used as the access language to most these data stores had the developers made that choice - take a look at VoltDB, MySQL Cluster, or Hive for examples of distributed SQL databases that have "NoSQL" features. When treated like a key/value store, MySQL with InnoDB is actually incredibly fast at primary key lookups (SELECT value FROM table WHERE key = ?) and it would be relatively easy to create a client library that creates a consistent hashing scheme to build a distributed MySQL cluster as one would use Riak, Redis, or memcached.

Bottom line is that you'll have to be more explicit about your needs if you want any more detail than that. Here are just a small subset of the questions that you will need to answer in order to even limit the field:

  • Is your access pattern realtime (OLTP) or will it be performed in batch operations (OLAP)?
  • Do you need to perform aggregate or set-based calculations on the data, or is it simply accessing keys by name?
  • How much data do you have and how is it structured?
  • Have you determined that a traditional SQL database will not suit your needs?
  • What are your CAP priorities?
  • Do you require ACID features?
  • What kind of operations do you need to perform on the data?

I hope that this helps you a bit in your research.