PostgreSQL for high volume transactions and for Data warehousing

data-warehouseetlpostgresql

Am quite new to PostgreSQL, I have never done a large deployment using it before. But, I have good experience in enterprise solutions and I want to try and apply some of what I learned using PostgreSQL.

I have site which is sized to handle large number of data and traffic. The infrastructure will be built using on amazon (AWS) using EC2 instances and EBS volumes.

The design should have two databases, a main transactional database and a data warehouse to handle analysis and reporting.

Main transactional database

will be used for live website, the site is built on multiple nodes to scale up concurrent users. Mainly we require the database for this case to be extremely fast in reading operations, we expect >100GB data with 30% annual growth. At this point, we are planning to use two EC2 servers (and add more later as we need).

my question, what is the recommended setup for the above requirements? Plus, is there a way to manage the table and volume partitioning? is there recommendations for using AWS setup?

Data warehouse database

Will be used mainly for capturing all the data from main transactional database in time dimension. so, even deleted records from main database will be captured in the DWH. Therefore, the data will be very large and growth will be even bigger. We will also use couple EC2 instances or more if required.

What is the recommended setup in this case? this will require fast writing operation because of the constant writing (ETL). Can we build OLAP cubes in PostgreSQL? if yes, has anyone out there gave it a try?

Connecting to database

The web servers will be connecting to the main database to query and write. We are currently developing an application using django which uses native library for connecting. Is it recommended to use the same basic method? or should we configure pgpool?

Data warehouse (ETL)

What is recommended way for building ETL processes to read from main and load to data warehouse? Any tools? methodology to follow? does PostgreSQL offer any helpful functions/tools in building ETL processes?

Best Answer

Infrastructure/Database Services

You should probably read this for an overview of a high-volume site that runs on AWS with EBS. They've moved to Ephemeral storage but have had to create some redundancy in being able to (re)store the data.

http://blog.reddit.com/2012/01/january-2012-state-of-servers.html

Data Warehouse/ETL

I've used Pentaho in the past. Not directly with postgres, but I've found it to be a good solution for both OLAP (Mondrian) and ETL (Kettle)

http://www.pentaho.com/

edit: "Community Editions" can be found here

http://mondrian.pentaho.com/

http://kettle.pentaho.com/

Connection

These folks seem to really like pgbouncer. https://stackoverflow.com/questions/1125504/django-persistent-database-connection

I have no experience with it, though. Apparently, Disqus uses it.