Postgresql – Continuously arriving data with fast filtering on date in PostgreSQL

postgresqlscalability

I have a huge amount of data, which arrives continously. (About 10 million rows/day, ~3000 rows/10 seconds in peak intervals of the day.)

I want to store them in a PostgreSQL database (I use PostGis to process the geographical data, that's why PostgreSQL).

I usually filter these data by timestamp first, and only after this I want to select data based on other properties (so it's rare when I need to work with all of the data).

My first approach was to create a table which has an index on the timestamp (to enable fast filtering on timestamp), and I insert the freshly arrived rows to the table every 10 seconds (so about 500-3000 rows).
This approach worked for the first few days, but after that inserting the data into the DB took more time than the rate in which the data arrived (so I wanted to insert new data every 10 seconds, but inserting took more than 10 seconds). (It's reason is obviously updating the index. I use COPY to insert the data, as recommended here.)

The other ideas I considered were the following, but they make the handling of the data difficult:

  • Putting the continuously arriving data into a different table, and copy it into the full table e.g. every night. But the time to copy the data will be longer every day, so it's not sustainable after a while.
  • Putting every day's data into a different table. But in this case it's difficult to create queries which have overlapping intervals, or contain multiple days. (I have to UNION the tables.) But still this solution seems to be the best yet…
  • Dropping the index before every insert (as written in the documentation), but the index recreation takes more and more time as the table grows, so it's also not a sustainable solution.

Because the data arrives continuously (timestamp is monotonically increasing), if I don't use DB, I could store these data into files (e.g. 1 file/hour), and use binary search on these files according to the timestamps, and that would be much faster than using the DB to store the data (no need to index), and it also scales well (the time I need to insert and query the same amount of data is the same, it doesn't depend on the total data size). So I think it's not impossible to create a solution which meets my requirements.

But obviously, it would result in a much less usable database and it would be much worse in other aspects without PostgreSQL's functions (e.g. further filtering data, PostGIS features).

Storing the data in these files and only loading them to a temporary table for the duration of processing seems to be another possible solution to solve the performance problem, but I think it would be much slower to query (because I have to physically copy the data on the disk) and it seems to be an ugly solution.

What other options do I have to handle this amount of data in a sustainable way (so the complexity of selecting and inserting the same amount of data doesn't depend on the size of the total data amount), but still make querying the data reasonably simple?

Best Answer

As @a_horse_with_no_name suggested, you could do partitioning of the tables, which seems ideal for what you are describing. Something like pg_partman would help significantly with the process:

https://github.com/keithf4/pg_partman

You could also try an unlogged table, with a job to load the table in between bursts, something like this:

http://theplateisbad.blogspot.com/2015/08/row-row-row-your-boat.html

But in the end, I think that partitioning is the best way to go here, so that you are writing to a smaller table with many other sibling tables, the table inheritance is hidden behind PostgreSQL's partitioning machinery, and entire partitions can be eliminated via constraint exclusion depending on timestamp ranges.