Postgresql – Why do psql queries take longer when inserting many rows. Why is it non linear

performancepostgresql

So if I insert a single row into a postgres database it takes 18 ms. If I do this in a loop like this:

INSERT INTO contacts (numbers)
SELECT  distinct array[
        (random() * 99999999)::integer,
        (random() * 99999999)::integer
    ]
  FROM generate_series(1,4000000) AS x(id);

And I vary the number of rows inserted, the time is nonlinear. Here is the data:

-1 record – 18 ms

-20k records – 36 seconds

-50k records – 151 seconds

-100k records – 750 seconds

Why is this getting exponentially bigger? I need 10 Million records in my database for load testing and it seems to be faster to insert 50k rows and then reinsert the 50k again since 151 + 151 < 750

Any insight on this topic would be appreciated. I assume it is because postgres saves data to rollback in case the query critically fails or is cancelled by the user and postgres does not want to "half insert" the total request.

Best Answer

Leaving aside the fact that the DISTINCT is causing some weird behavior, there are two main reasons why insert times get longer as bulk loads get larger:

  1. B-tree indexes get less efficient to update as they get larger and have more tree levels. So indexes take longer to insert a the millionth value than they did the 10th.
  2. At certain sizes, you exceed certain thresholds which cause extra IO on the system, resulting in lag while the IO takes place. These thresholds, which interact in complex ways, include:
    • the size of the WAL, causing log rotation
    • the size of the RAID cache, dropping to disk speeds
    • the size of Postgres' dedicated cache, causing flushing to the FS
    • the size of the FS cache's dirty block flushing threshold
    • the size of the entire FS cache, causing emergency flushing