Using PostgreSQL 9.3 on Windows 7×64, CoreI7, SSD.
Assume a PostgreSQL table with 8 Int32 columns and following C# code to fill the table.
using (var connection = new NpgsqlConnection("User Id=postgres;Password=**;host=localhost;database=**;"))
using (var command = new NpgsqlCommand())
{
connection.Open();
command.Connection = connection;
command.CommandText = "COPY \"MyTable\" FROM STDIN;";
var copyInSerializer = new NpgsqlCopySerializer(connection);
var copyIn = new NpgsqlCopyIn(command, connection, copyInSerializer.ToStream);
copyIn.Start();
for (int i = start; i < start + 1000000; i++)
{
copyInSerializer.AddInt32(i);
//add 7 other int32 values
copyInSerializer.EndRow();
}
copyInSerializer.Flush();
copyIn.End();
connection.Close();
}
On my test hardware this took ~10 seconds execution time and measured disk usage of ~20MB/s.
CPU and RAM do NOT hit limits.
The given SSD supports ~60MB/s writing random 4k blocks so I wondered why it would not execute faster. I created a second, identical table in the same database and disk and ran the above code sample in parallel (two tasks) on both tables.
Both took ~10 seconds, and Measured Disk usage was ~40MB/s.
Why does PostgreSQL not focus all available resources on the one and only operation happening?
Is there a way to force PostgreSQL to use the full available disk bandwidth?
Best Answer
ADO.NET (used in the code sample) obviously communicates via network - even if a local Postgres instance is used. Using STDIN therefore results in quite a bit overhead. Before Postgres is able to insert the bulk data, it needs to be:
This basically explains/justifies the described behaviour.
If the C# Client and Postgres are run on the same computer, you can speed up bulk inseration by writing a csv file on a local ssd drive (even better a ram drive) and execute the SQL statement 'Copy MyTable From x:\test.csv'.