Postgresql – Bulk loading to partitioned table

csvpartitioningpostgresqlpostgresql-9.3

I am designing a database application which will be using PostgreSQL 9.3. My question is regarding bulk loading a CSV to a partitioned table. I have done a lot of research, but I'm still unsure of the best way to proceed.

In our application, we download data in CSV format every day from web-enabled devices and upload these to a single database table. Each CSV contains a record for every minute in the day and we have a few hundred devices, so we end up loading about a million records each day. This will increase though, as we are adding more devices every day.

We have found that we have performance issues as this table gets too large, so I would like to redesign it as a partitioned master table which uses range partitioning on the date field, so that each child table contains the records for a single month. This seems like the best balance for us between table size and number of partitions.

My understanding is that the most efficient way for us to load data would be to insert the records to a staging table, then alter the table so it is a child of our master table. In our case, though, this won't be possible because we need to update the table every day with new data; we cannot wait until we have a full partition after each month, then make that month's table a child of the master table.

So far, I've come up with three possible solutions:

  1. Use bulk loading to send the CSV to the master table. Triggering would be used to send the data to the correct child partition, as described here. I'm very concerned about the performance of this, as my understanding is that the triggers would be applied to every record that gets inserted.
    According to that link, using rules instead of triggers would give better performance, as they would apply to the entire query instead of every record. Unfortunately, it also says rules are not supported with bulk inserts that use the COPY command.

  2. Use bulk loading to send the CSV to the correct child table. Since we generally upload one CSV at a time which contains one day's data for one device, we can add some logic to our ETL code that determines the correct child table for that day and sends it there. This is not ideal because I would like the partitioned table design to abstract this logic away and take care of it in the background.

  3. Create another table which contains only the current month's data. Update this table each day, and when the month is over, alter this table so it becomes a child of the master table with all the other past months' data. Then create a new current-month table and start updating it instead.

    I like this approach because it minimizes updates to the master table. We do have a use case where we have to re-load data from previous months, so we will probably still have to use one of the previous methods for these cases.
    The disadvantage with this approach is that our master table will not contain the data for the current month, as that will be located in a different table.

I'd appreciate anyone's input on this problem and the best solution, as I'm still somewhat new at this. Please let me know if I've made any incorrect assumptions or I'm just flat-out wrong about something.

Best Answer

Option 2 would be the fastest. You are loading it directly, append-only. Option 3 would be similar except that you are updating data.

Option #4: Create a staging table (UNLOGGED) and COPY the CSV into that. After loading, use straight SQL or PLPGSQL to INSERT-INTO-CHILD--SELECT--FROM-. That allows fast bulk loading, with optional transformation steps after the load. If you have few indexes on the staging table (which you ideally add after the bulk load), you can have the whole ETL process work quite rapidly. Using plpgsql to load your child partition from the staging table allows gives you error-handling capabilities.