Postgresql – Why do psql queries take longer when inserting many rows. Why is it non linear

performancepostgresql

So if I insert a single row into a postgres database it takes 18 ms. If I do this in a loop like this:

INSERT INTO contacts (numbers)
SELECT  distinct array[
        (random() * 99999999)::integer,
        (random() * 99999999)::integer
    ]
  FROM generate_series(1,4000000) AS x(id);

And I vary the number of rows inserted, the time is nonlinear. Here is the data:

-1 record – 18 ms

-20k records – 36 seconds

-50k records – 151 seconds

-100k records – 750 seconds

Why is this getting exponentially bigger? I need 10 Million records in my database for load testing and it seems to be faster to insert 50k rows and then reinsert the 50k again since 151 + 151 < 750

Any insight on this topic would be appreciated. I assume it is because postgres saves data to rollback in case the query critically fails or is cancelled by the user and postgres does not want to "half insert" the total request.

Best Answer

Leaving aside the fact that the DISTINCT is causing some weird behavior, there are two main reasons why insert times get longer as bulk loads get larger:

B-tree indexes get less efficient to update as they get larger and have more tree levels. So indexes take longer to insert a the millionth value than they did the 10th.
At certain sizes, you exceed certain thresholds which cause extra IO on the system, resulting in lag while the IO takes place. These thresholds, which interact in complex ways, include:
- the size of the WAL, causing log rotation
- the size of the RAID cache, dropping to disk speeds
- the size of Postgres' dedicated cache, causing flushing to the FS
- the size of the FS cache's dirty block flushing threshold
- the size of the entire FS cache, causing emergency flushing

Related Solutions

MySQL – Why SELECT * Takes Longer Than SELECT

Your query:

SELECT * 
FROM cupom_fiscal 
WHERE send_date IS NULL 
  AND travado IS FALSE 
  AND status = 0 
  AND ong_id = 1618 
  AND data_emissao BETWEEN '2014-04-01' AND '2014-05-05' 
LIMIT 1 ;

involves only one table and 5 of its columns in the WHERE clause. Four conditions are equality (=) and the 5th is a range condition (BETWEEN .. AND).

(parenthesis)

The query is not very efficient as it uses the index_merge algorithm, i.e. it uses more than one index (actually two: status and fkey_id_ong) and then "merges" the results using the index_merge/intersect algorithm (Read: Using intersect(status,fkey_id_ong);.) After that operation, it has to also check the actual rows of the table (because, I guess, these two indexes do not include all the 5 columns of the WHERE clause.) This also has a negative effect on efficiency.

The best index for this query would be a 5-column index where the 5th column in data_emissao (the column with the range condition.) The order of the first 4 columns in the index would not matter.

It may matter for other queries but not for this one. So, choose an order that would help other queries as well (if for example you have many queries that use ong_id, use that as the 1st column of the index.

PostgreSQL Query Performance – How to Get Row with Latest Timestamp

Basic answers

Since you select a couple of big columns an index-only scan is probably not a viable option.

This code works (if no NULL values in data!)

While the column isn't defined NOT NULL, add NULLS LAST to the sort order to make it work in any case, even with NULL values. Ideally, use the clause in the corresponding index as well:

SELECT <some big columns>
FROM   my_table_ 
ORDER  BY when_ DESC NULLS LAST
LIMIT  1;

PostgreSQL sort by datetime asc, null first?

Without any index on when_ column, does this statement require a full scan of all rows?

Yes. Without index, there is no other option left. (Well, there is also table partitioning where an index on key columns(s) is not strictly required, and it could assist with partition pruning. But you would typically have an index on key columns there, too.)

With an index on when_ column, should I change this SQL to use some other approach/strategy of query?

Basically, this is the perfect query. There are options in combination with advanced indexing:

Advanced technique

Assuming a NOT NULL column. Else, add NULLS LAST to index and queries as suggested above.

You have a constant influx of rows with later when_. Assuming the latest _when constantly increases and never (or rarely) decreases (latest rows deleted / updated), you can use a very small partial index.

Basic implementation:

Run your query once to retrieve the latest when_, subtract a safe margin (to be safe against losing the latest rows) and create an IMMUTABLE function based on it. Basically a "fake global constant":
```
CREATE OR REPLACE FUNCTION f_when_cutoff()
  RETURNS timestamptz LANGUAGE sql COST 1 IMMUTABLE PARALLEL SAFE AS
$$SELECT timestamptz '2015-07-25 01:00+02'$$;
```
PARALLEL SAFE only in Postgres 9.6 or later.
Create a partial index excluding older rows:
```
CREATE INDEX my_table_when_idx ON my_table_ (when_ DESC)
WHERE when_ > f_when_cutoff();
```
With millions of rows, the difference in size can be dramatic. And this only makes sense with a much smaller index. Just half the size or something would not cut it. Index access itself is not slowed much by a bigger index. It's mostly the sheer size of the index, which needs to be read and cached. (And possibly avoiding additional index writes, but hardly in your case.)
Use the function in all related queries. Include the same WHERE condition (even if logically redundant) to convince the query planner the index is applicable. For the simple query:
```
SELECT <some big columns>
FROM   my_table_ 
WHERE  when_ > f_when_cutoff()
ORDER  BY when_ DESC
LIMIT  1;
```

The size of the index grows with new (later) entries. Recreate the function with a later timestamp and REINDEX from time to time with no or little concurrent access. Only reindex after a relevant number of rows has been added. A couple of thousand entries won't matter much. We are doing this to cut off millions.
The beauty of it: queries don't change.

Implementation with function to update the partial index automatically:

Get latest child per parent from big table - query is too slow

More general advice:

Index optimization with dates