Postgresql – Deciding the best way to partition (Postgresql)

partitioningpostgispostgresql

I have been trying to research this for quite a while now and I am having a hard time trying to figure it out. I am out of ideas other then setting up an extra test database and running through all of the possibilities to see how they perform. Hopefully someone can pass on a bit of knowledge to prevent having to do that.

I have a database (Postgresql 9.2 with PostGIS) with a large amount of spatial data in it. Typically I look at it by location and date, however I do have to be able to search across all of the fields pretty regularly. The amount of data we receive in a day varies and continues to grow over time, so we cant guarantee the size of a partition on disc or the number of rows in one if we partition by date. I have read that setting a limit on the number of rows in a partition helps, because you can basically make a maximum size for each partition and allow for faster reads, but I do not have a sequential id field to partition off of either.

I have found functions that will allow me to partition by date, or possibly partition off of a calculated serial field (not sure on this one yet). My main question for now is: how should I decide which is the best way to do it? Is there an inherent benefit to one way or the other? Is there even a way to tell which will be better without implementing both and testing them?

Best Answer

It isn't even clear if you should be partitioning at all.

PostgreSQL's table partitioning is a bit primitive and comes with some limitations in enforcing referential integrity, etc. If your query pattern doesn't overwhelmingly favour filters on a particular field where you can benefit from constraint exclusion it might not help you much.

Partitioning can ease certain maintenance tasks, in particular bulk drops of data that rotates through by age. Dropping the whole of last month's data can be easier than doing a big slow DELETE with follow-up VACUUMing. This alone is rarely worth partitioning for, though.

You really do need to test for your workload and query pattern. Use the data in pg_stat_user_indexes and pg_stat_user_tables on your system as it's currently running to guide you about query patterns. I also recommend installing the pg_stat_statements extension, which will collect more info about query patterns.