Postgresql – the modern way to partition PostgreSQL across machines, when the data is “naturally partitionable”

high-availabilitypartitioningpostgresql

After several years of dwelling into the "NoSQL" space, now I have a problem that is quite "relational" in its nature.
Today I see data stores with quite different eyes than before. Things like Riak have spoiled me in a way that I can no more tolerate single points of failure, "down for maintenance" etc. Of course, (or I hope), I havent' lost my sanity totally. This is a a personal project that doesn't quite (or yet) have extremely high requirements.

Most of the sharding solutions don't give me what I want (at least on a glimpse), probably because my problem is quite "easy" to solve. At least on conceptual level (ignoring the restraints that RDBMs themselves bring to the table).

  1. I have a small amount of "shared" data, which can be duplicated freely. It doesn't have requirements of hard consistency. This can be stored in a dynamo-like database and will scale infinitely. But I still would like to go with a single database if possible.

  2. I have lots of "per-user" data. That is – lots of users, with every user having data of absolutely reasonable size, really fit to be stored on a single PostgreSQL node. We are talking about 10s of thousands records at maximum.

  3. I never need to query cross-user and I don't need cross-user atomicity.

This sounds extremely easy to achieve. At least when I'm looking at it with my "NoSQL eyes".

Here are my naive starter ideas:

  1. At the very extreme, I could just serialize the whole user as a single Key/Value in Riak. Of course, constant de/serialization of several megabytes of data will be slow and that's why I'm considering using PostgreSQL. Lots of Riak K/Vs is a no-go, as I need atomicity/transactions within each user's data.

  2. I could use an SQLite database per user, and use something like GlusterFS for the redundancy/availability. This is probably the solution I'm going to choose if I can't find something equally good using PostgreSQL. Pros: Can down/up scale really well; Cons: I'd prefer having PostgreSQL's types and strictness over SQLite

So, what I would ideally request from a PostgreSQL sharding solution:

  1. Automatically keep several copies of every user's data around (on different machines). Be able to dynamically switch the master node per user/shard (if the previous master goes down).
  2. Be able to dynamically up/down scale, by adding/removing server nodes. Mostly like Riak is able to do.
  3. Do not require my application to know which nodes to talk to and when.

Best Answer

Postgres-XL is attempting to solve this as of 2014. They're aiming directly at big data on PostgreSQL, and they have developers from Stado onboard.