Postgresql – Load Balancing / MPP with Postgres 9.1

postgresql

I'm trying to setup a testbed for bigdata analytics (basically around 2 to 3 TB base, slight write [only feeding the db], but possibly read intensive queries for analytics purposes).

Since I'm somewhat experienced with vanilla PostgreSQL (but with 10GB base), I was considering trying the 9.1 new synchronous replication to achieve load balancing. The objective is to get Tableau be very responsive. Thus, I could avoid costly MPP GreenPlum, AsterData and others.

However, I was wondering if :
– going straight to MPP Databases, GreenPlum for example, was a good idea ? (for 2-3 TB, is Greenplum really needed?)
– once Synchronous Replication is up and running, how shall I configure my applications ? Is the load balancing done by the master or is another solution need (Pgpool for example) ?

Thank you for you insights !

Best Answer

In principle, 2-3TB should be possible to do without resorting to a shared-nothing architecture, but Vanilla PostgreSQL still does not have good parallel query functionality. All you will achieve by replication is the ability to apportion queries out to individual nodes. I don't believe that PostgreSQL supports federated queries out of the box and I would be very surprised to find that Tableau has direct support for sharding at the client. I'd guess that PostgreSQL won't perform very well on a data set this big.

There is work in progress on a parallel query facility for PostgreSQL, but AFAIK it's not included in 9.1. My gut instinct is that some release in the next few years will include this feature but it's not quite there yet. I don't see evidence of much effort being put into a federated query facility.

Another option: SQL Server

Unless you're married to Postgres you might find that SQL Server provides a cost-effective option to Greenplum for a 2-3TB data set. It is licenced by the socket rather than the core, so a loaded 2-4 socket Xeon or Opteron box gives good bang for buck as a platform. For a smaller user base I believe you can still licence Enterprise Edition by CAL as well.

A couple of 24/25 disk arrays on a high-end RAID controller will do sequential reads fast enough to saturate a PCIe-x8 slot (2GB/sec). A simple table scan query with SQL Server will handle data at this rate without using much CPU (obviously depending on the actual computation), so you have some headroom to add controllers and arrays if you want faster I/O.

SQL Server also comes with a passably good set of B.I. tooling, including an OLAP server. Tableau is not cheap, around £1,800/seat last time I looked. Depending on the number of users you may find that the 'good-enough' tooling that comes with SQL Server could offset the costs of the DB licence anyway. Most third party reporting tools also play nicely with SQL Server.

Disclaimer: I'm not any sort of die-hard Microsoftie but I have done a lot of B.I. work with SQL Server and Oracle over the past decade or so. SQL Server is actually quite a good B.I. platform.