PostgreSQL – High Volume Data Aggregation Techniques

postgresqlpostgresql-9.1

I need to aggregate incoming data and save the resulting aggregates to a table. A sample of the data is below.

id,time
1967,2014-04-12 00:42:55+00
1967,2014-04-12 00:42:51+00
1972,2014-04-12 00:42:51+00
1972,2014-04-12 00:42:49+00
1972,2014-04-12 00:42:40+00

I need to do a simple aggregation:

SELECT id, MIN(time),MAX(time),COUNT(id) FROM xx GROUP BY id

I've tried a few things:

Directly updating a table each time a piece of data comes in. Was too slow of course.

Saving the incoming data to a temporary table and periodically running an aggregation query similar to the one above and then saving the result to a table. This approach was better but proved too much for my disks as volume increased.

My current approach is a daemon program that reads incoming data, aggregates it in memory and then saves the results to a table based on some time criteria. This approach works much better but still cannot scale as much as I'd like.

I am using Postgres 9.1 on a 4-disk RAID 10 system with 15k SCSI drives and 32gb of RAM. There is no disk partition, WAL and everything is on one disk. I realize this is a large performance hit. Hardware changes are problematic at this point.
This system cannot keep up with what I need to do. At the moment I need to write thousands of these aggregates per second and the solution I'll choose has to scale to much more.

I have been exploring distributed database options and I like the ideas behind Cassandra. However I am not up to speed on all the options as I'm primarily a software developer.
Moving to a distributed database in the cloud seems like a good option for a next step. Being able to sort by the min,max and count columns would be ideal but I can do that in the application layer if I have to.

Another important note is that the result of these data aggregations are actually rarely needed and are only for human consumption. A tiny fraction of these will actually be viewed. I tried generating the aggregations realtime (by querying a table that has the raw data) but they ended up with sequential scans on large tables (100m+ rows). If aggregating this much data and having most of it not seen is nonsensical I'd be happy to hear of a way to generate them upon request. My SQL skills are basic it is possible I am missing techniques that will help.

Here's some relevant parts of my postgres config:

shared_buffers = 22000MB            
temp_buffers = 256MB            
work_mem = 896MB                
maintenance_work_mem = 16MB     
fsync = off             
wal_buffers = 64MB          
checkpoint_segments = 128       
checkpoint_timeout = 60min      
checkpoint_completion_target = 0.9  
random_page_cost = 2.0          
effective_cache_size = 8192MB
default_statistics_target = 200 
autovacuum_max_workers = 6      
autovacuum_naptime = 15s        
autovacuum_vacuum_threshold = 25    
autovacuum_analyze_threshold = 10   
autovacuum_vacuum_scale_factor = 0.1    
autovacuum_analyze_scale_factor = 0.05  
autovacuum_vacuum_cost_delay = 10ms 
autovacuum_vacuum_cost_limit = 1000 

So my questions:

How is my postgres config for these tasks?
What kind of limits for this task am I looking at with my current hardware?
Are there distributed databases well suited to my needs?

Best Answer

I would write the raw data to a very basic table without indexes or constraints, not even a primary key.

If you can, insert many rows at once, that's faster than single-row inserts.

If you can afford loosing some data in a catastrophic event, make that an unlogged table, that's faster.

If you can afford loosing some data in a catastrophic event and all inserts can run in a single session, make that a temporary table, that's even faster. Probably not possible, though.

Start a new partition every time unit of your choice (daily?) and aggregate the old partition into permanent storage at some opportune moment.