Postgresql – Optimizing for data that does not fit in memory

We have a database that contains around 235 GB of data (calculcated using pg_total_relation_size for each table) on a server with around 77GB of memory. The server is used for nothing else (aside from SSH, backup and similar administrative stuff). We are running postgres 9.2.

I extracted one table of 13GB and tested it on a seperate server with 4GB of memory to benchmark some queries, indexes, etc. outside of our production environment. I had some questions for this which can be seen here: https://stackoverflow.com/questions/26468982/avoiding-external-disk-sort-for-aggregate-query.

On the seperate server, the queries use index-only scans and run relatively fast (compared to pre-index creation with sequence scans). The same queries against the same table on the production machine does a sequence scan almost no matter what settings I set up (short of setting seq_page_cost lower than random_page_cost, which is a Bad Idea(tm).)

Our workload is to extract analytical data – we have around 10 columns that are raw numbers that we do a lot of SUM() aggregate queries on, usually grouped by an item ID. The data is updated constantly, mostly with UPDATE queries to existing rows to update per-day data is it comes in during the day (so a row for a specific day may be updated ~20 times in one day). The 13GB table is for one of our biggest customers and have around 6 million rows.

The customized settings from the production config looks like this:

shared_buffers = 20000MB
effective_cache_size = 56000MB
work_mem = 512MB
maintenance_work_mem = 2048MB
checkpoint_segments = 64
temp_buffers = 16MB
max_stack_depth = 4MB

wal_level = archive
archive_mode = on
archive_timeout = 6h
checkpoint_timeout = 1h
wal_writer_delay = 100
wal_keep_segments = 32

I am looking for some insights for how to create indexes (and alter queries), as obviously in the larger setup the query planner acts differently based on the amount of data.

Is it better with small indexes that only cover the small parts of the query to help with lookups or full covering indexes (for index-only scans)?
Does the above setup match my server specifications and workload?
Any good tips to speed up aggregate queries (mainly SUM) for all rows and grouped by item ID (I am aware we may need to look into column-store databases)?

Also, does any of the above stand out as being "wrong" or should I modify it? We are running it off locally attached disks, but backup is being sent off to NFS disks (SAN).

Best Answer

it is important to know, which queries you are running. otherwise it is close to impossible to give you any real advise.

however, there are two things you should know:

UPDATE benefits nicely from a thing called FILLFACTOR. it can speed up things a lot.
the order of columns does have an impact on the size of the table (http://www.cybertec.at/shrinking-the-storage-footprint-of-data/). it can change things quite nicely

but as i said: i got to see your queries to really help.

Best Answer

Related Solutions

Sql-server – SQL Filegroup degrading performance

Sql-server – n easy way to debug and find out why a query runs slow

Related Question