Postgresql – Large bulk insert performance difference PostgreSQL vs MariaDB (InnoDB)

bulk-insertperformancepostgresqlpostgresql-performance

I've got the very same single-table definition in both PostgreSQL 11.4 and MariaDB 10.4.6 (InnoDB) and I'm seeing a significant difference in INSERT performance I cannot explain for PostgreSQL.

My simplified table definition (in production I have a few indexes):

CREATE TABLE my_table_def
(
    user_id bigint NOT NULL,
    serial bigint NOT NULL,
    object_type smallint NOT NULL,
    other_type smallint NOT NULL,
    other_id bigint NOT NULL,
    timestamp_ns bigint NOT NULL,
    data bytea, -- typically ~ 50-200 bytes, sometimes NULL
    CONSTRAINT my_table_def_pkey PRIMARY KEY (user_id, serial)
    -- ^ MariaDB no constraint, just PRIMARY KEY
);

My inserts are in serial order, but with huge spread across user_id values, ie. out-of-order writes against the PRIMARY KEY here. (My SELECTs are always on single value for user_id, but that's unrelated to my question here.)

That out-of-order inserting should make a huge difference between PostgreSQL and MariaDB, in favor of PostgreSQL, because of how data clustering (ordering on disk) is handled completely different; InnoDB uses the PRIMARY KEY as the clustered index for me, and handles that while inserts are happening. PostgreSQL does not cluster on insert time, but simply appends new data at the end of the table data and will cluster the data only when manually triggered (e.g. CLUSTER or pg_repack).

What I'm observing in write/insert performance however is the complete opposite as to what I expected: a much lower insert performance on PostgreSQL. Moreover, I am unable to identify a bottleneck as well.

I'm inserting in batches of 50k rows in this pattern, using a single connection:

  • BEGIN transaction
  • 50x INSERT INTO table (a, b, c) VALUES (… 1000 rows …)

    (also tried 5x 10.000 rows)

  • COMMIT

This is done in Python 3.6, psycopg2 2.8.3 with SQLAlchemy 1.3.4, with a line like this:

session.execute(schema_declarative_class.__table__.insert(),
                params=my_bulk_save_list)

Both PostgreSQL and MariaDB are 'tuned', and I see I/O bottlenecks coming up with MariaDB once I reach a table size as big as the buffer pool, which is expected, due to the clustering of data while inserting. The write amplification observed with InnoDB quite tremendous for which I was unable to tweak it adequately, hence my experiment with PostgreSQL.

For PostgreSQL I don't see a clear bottleneck. I/O is VERY low as expected, writing only single digit megabytes per second on my SSD, in ~ 100 IOPS (peanuts!).
CPU-wise I see 40% of one core CPU usage on the PostgreSQL server, and ~ 5% of one core CPU usage on the PostgreSQL client application.
Memory-wise I see like 1.2 times the size of shared_buffers being in use, looks good to me.

Settings for PostgreSQL, given on the command line to be sure to overrule anything else:

shared_buffers=1GB
effective_cache_size=2GB
maintenance_work_mem=512MB
synchronous_commit=off
fsync=off
full_page_writes=off
checkpoint_completion_target=0.9
wal_buffers=16MB
default_statistics_target=500
random_page_cost=1.1
effective_io_concurrency=200
work_mem=26214kB
min_wal_size=2GB
max_wal_size=4GB
max_worker_processes=4
max_parallel_workers_per_gather=2
max_parallel_workers=4
autovacuum=off

PostgreSQL: ~ 8k inserts/s (before having run CLUSTER)
MariaDB: ~ 20k inserts/s (already clustered on-line)

I was hoping to get near 100k inserts/s, simply because this workload is append-only, and I can afford to have only a single index at insert-time.

What have I tried:

  • No PRIMARY KEY, having no index on the table. Increased performance slightly to 11k/s. However, this is unacceptable for read performance for the table being appended to.
  • Different sizes of batches, with several numbers and sizes of in-transaction INSERT sub-batches. It seems that 5x10k per transaction is pretty much topping it out already.
  • Using psycopg2 batch mode. It works a bit; getting just over 10k inserts/s. Looking at what happens it seems that it's really optimizing how inbound SQL is parsed at the server, giving this effect makes me wonder how slow the parser is compared to MariaDB…
  • Prepared statements. Minimal difference.
  • Using PostgreSQL 12 (beta at time of writing). No difference.
  • Left out the bytea column. Difference is noticeable, but seems proportional to the total row length: ~ 20-30% increase.
  • Using int instead of biginteger same as above, row is smaller, only a little bit faster.

What I have not tried yet:

  • Using multiple connections simultaneously: not an option anyway, I have a strong requirement to insert batches serially, the database should never have gaps. Ie. per transaction I can only append contiguously to the data present.
  • Use different client libraries/tooling. psycopg2 should be quite fast, and it does not look like there's any bottleneck here.
  • Bulk insert using COPY. (Will do soon.)

Out of the question:

  • Network bandwidth: observing the same in a local Docker container.

Another interesting thing I have observed; I was playing with the Foreign Data Wrapper to shard (partition) my table across several machines. The postgres_fdw instance acting as proxy, not storing any data, seems to show the very same CPU usage as a single backend node: 40% of a single core CPU usage, while backend servers are showing 1/N of those amount of CPU cycles where N is the amount of servers. Hmm.

I also can't really seem to find benchmarks of pure-insert performance of a set up similar to mine. Lots of them seem to be about SELECTs, not INSERTs.

The documentation lists quite a few options about WAL flushing, but it seems that my WAL never gets anywhere near full. It's writing to data at less than a megabyte a second with under 100 IOPS, a load similar to near-idle desktop usage…

My conclusion so far: PostgreSQL seems very inefficient at decoding messages received from the client in terms of CPU cycles taken, but it does not seem to be the bottleneck either. No other clear bottlenecks, simply looks like I'm 'throttled'.

What else can I do to identify the bottleneck here? Or am I observing a bug or am I doing something wrong?

Best Answer

Use \copy (or COPY - server side) directly against the data node.

Preparing the batch in CSV, loading the data from it, seeing 160k+ inserts/s with \copy.

I don't see how loading from ASCII encoded CSV files is more efficient than sending over prepared statements from the server point of view, but it works. Something seems very poor performace-wise in libpq with doing multi-value INSERTs.

However, the big dealbreaker still exists with \copy if you can only insert against the coordinator node of an FDW cluster. I'm using postgres_fdw and I would like to insert against the node serving the table, sharded to foreign smaller tables. What I'm observing is that the CPU usage of the coordinator drops to near-zero (nice!), but the CPU usage of the data nodes rises as before and everything is clearly bottlenecked again. It looks like postgres_fdw splits up the \copy bulk to multi-value INSERTs to the data nodes, showing the same poor performance again, now caused at the individual data nodes.

However, that FDW performance issue seems a different one to solve, so I'm marking this as solved.