Postgresql – Storing and querying rolling data in PostgreSQL

database-designpartitioningpostgispostgresqlpostgresql-9.3

I have a large quantity of weather model data being put into a PostgreSQL database. The machine has 8 cores and 16 GB of RAM. I'm running PostgreSQL 9.3 with PostGIS 2.1. Each table will have a different variety of weather data (temp, dew point, wind, etc.). Each table will have 6-7 columns: latitude, longitude, point geometry, elevation, date-time the model is relevant for, and 1-2 data values of interest. The data will be primarily queried for a bounding box by time and elevation. There will be approximately 145,757,360 rows per table (data older than now is no longer relevant will be deleted). I roughly estimate the size of the tables to be about 10 GB each without indexes. (That's 52 bytes of data plus 23 bytes of overhead per row). The data will be regularly updated/inserted as new model data becomes available. Note: new data will include overwriting existing values in the table.

So I'm looking at these two plans:

  1. Simply index and cluster by (datetime, elevation) with an additional index for the point geometry. Run a regular cron job that deletes old rows, runs vacuum/analyze, and re-clusters.
  2. Partition by datetime and then cluster and index by elevation per table with an index on geometry. Run a regular cron job to add new tables going forward and drops old tables.

Further,

  • So, I know that dropping a table is much more efficient and deleting and vacuuming. But would I see a performance boost otherwise?
  • Are partitions appropriate when the all of the tables will be evenly updated and selected on until deleted as irrelevant (the documentation indicated that partitions worked best when only a few of them would be selected on)?

When delivering data will the selects be any faster than the clustered index? Does the answer change if multiple requests are being made at once?

Thank you. I hope I put up all the needed data. If not let me know and I'll add it.

Best Answer

All things considered, I would go with option 2. Dates will be evenly selected on, but I'm going to guess that for a given query only one or two date partitions will be involved. It's a shame you can't cluster on geolocation and partition on date, which would be ideal. Elevation tends to correlate with geolocation anyway, if the bounding boxes are sufficiently small.

Given the choices available, cleaner data operations and avoiding a daily vacuum is a good thing to have.

Delivering selects may be faster with option 1, though I suspect it will probably be a wash. With option 1, records with the same date and elevation are placed near each other in one big clustered index. With option 2, records with the same date and elevation are placed near each other in many smaller clustered indexes.