Postgresql – Can’t get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

performancepostgresqlraidunix

It's months that I'm trying to solve a performance issue with PostgreSQL.

SYSTEM CONFIGURATION

Our deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID controller configured in this way:

  • VD0: two 15k SAS disks (ext4, OS partition, WAL partition, RAID1)
  • VD1: ten 10k SAS disks (XFS, Postgres data partition, RAID5)

This system has the following configuration:

  • Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)
  • 128GB RAM (DDR3, 8x16GB @1600Mhz)
  • two Intel Xeon E5-2640 v2 @2Ghz
  • Dell Perc H710 with 512MB RAM (Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "disabled"):
    • VD0 (OS and WAL partition): two 15k SAS disks (ext4, RAID1)
    • VD1 (Postgres data partition): ten 10k SAS disks (XFS, RAID5)
  • PostgreSQL 9.4 (updated to the latest available version)
  • moved pg_stat_tmp to RAM disk

My personal low cost and low profile development machine is a MacMini configured in this way:

  • OS X Server 10.7.5
  • 8GB RAM (DDR3, 2x4GB @1333Mhz)
  • one Intel i7 @2.2Ghz
  • two Internal 500GB 7.2k SAS HDD (non RAID) for OS partition
  • external Promise Pegasus R1 connected with Thunderbolt v1 (512MB RAM, four 1TB 7.2k SAS HDD 32MB cache, RAID5, Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "enabled", NCQ: "enabled")
  • PostgreSQL 9.0.13 (the original built-in shipped with OS X Server)
  • moved pg_stat_tmp to RAM disk

So far I've made a lot of tuning adjustments to both machines, including kernel reccomended ones on the official Postgres doc site.

APPLICATION

The deployment machine runs a web platform which instructs Postgres to make big transactions over billion of records. It's a platform designed for one user because system resources have to be dedicated as much as possible to one single job due to data size (I don't like to call it big data because big data are in the order ob ten of billion).

ISSUEs

I've found the deployment machine to be a lot slower than the development machine. This is paradoxal because the two machine really differs in many aspects. I've run many queries to investigate this strange behaviour and have done a lot of tuning adjustments.

During the last two months I've prepared and executed two type of query sets:

  • A: these sets make use of SELECT ... INTO, CREATE INDEX, CLUSTER and VACUUM ANALYZE.
  • B: these sets are from our application generated transactions and make use of SELECT over the tables created with set A.

A and B were always slower on T420. The only type of operation that was faster is the VACUUM ANALYZE.

RESULTS

A type set:

  • T420: went from 311seconds (default postgresql.conf) to 195seconds doing tuning adjustments over RAID, kernel and postgresql.conf;
  • MacMini: 40seconds.

B type set:

  • T420: 141seconds;
  • MacMini: 101seconds.

I've to mention that we have also adjusted the BIOS on T420 setting all possible parameters to "performance" and disabling low energy profiles. This lowered time execution over a type A set from 240seconds to 211seconds.

We have also upgrade all firmware and BIOS to the latest available versions.

Here are two benchmarks generated using pg_test_fsync:

T420 pg_test_fsync

60 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                   23358.758 ops/sec      43 usecs/op
        fdatasync                       21417.018 ops/sec      47 usecs/op
        fsync                           21112.662 ops/sec      47 usecs/op
        fsync_writethrough                            n/a
        open_sync                       23082.764 ops/sec      43 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                   11737.746 ops/sec      85 usecs/op
        fdatasync                       19222.074 ops/sec      52 usecs/op
        fsync                           18608.405 ops/sec      54 usecs/op
        fsync_writethrough                            n/a
        open_sync                       11510.074 ops/sec      87 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write       21484.546 ops/sec      47 usecs/op
         2 *  8kB open_sync writes      11478.119 ops/sec      87 usecs/op
         4 *  4kB open_sync writes       5885.149 ops/sec     170 usecs/op
         8 *  2kB open_sync writes       3027.676 ops/sec     330 usecs/op
        16 *  1kB open_sync writes       1512.922 ops/sec     661 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close             17946.690 ops/sec      56 usecs/op
        write, close, fsync             17976.202 ops/sec      56 usecs/op

Non-Sync'ed 8kB writes:
        write                           343202.937 ops/sec       3 usecs/op

MacMini pg_test_fsync

60 seconds per test
Direct I/O is not supported on this platform.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      3780.341 ops/sec     265 usecs/op
        fdatasync                          3117.094 ops/sec     321 usecs/op
        fsync                              3156.298 ops/sec     317 usecs/op
        fsync_writethrough                  110.300 ops/sec    9066 usecs/op
        open_sync                          3077.932 ops/sec     325 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      1522.400 ops/sec     657 usecs/op
        fdatasync                          2700.055 ops/sec     370 usecs/op
        fsync                              2670.652 ops/sec     374 usecs/op
        fsync_writethrough                   98.462 ops/sec   10156 usecs/op
        open_sync                          1532.235 ops/sec     653 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write          2634.754 ops/sec     380 usecs/op
         2 *  8kB open_sync writes         1547.801 ops/sec     646 usecs/op
         4 *  4kB open_sync writes          801.542 ops/sec    1248 usecs/op
         8 *  2kB open_sync writes          405.515 ops/sec    2466 usecs/op
        16 *  1kB open_sync writes          204.095 ops/sec    4900 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close                2747.345 ops/sec     364 usecs/op
        write, close, fsync                3070.877 ops/sec     326 usecs/op

Non-Sync'ed 8kB writes:
        write                              3275.716 ops/sec     305 usecs/op

This confirms the hardware IO capabilities of T420 but doesn't explain why MacMini is MUCH MORE FAST.

Any ideas?

UPDATE 1

B type set are transactions, so it's impossible for me to post EXPLAIN ANALYZE results. I've extracted two querys from a single transactions and executed the twos on both system. Here are the results:

T420

Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk

UPDATE 2

I compiled the original Postgres 9.4.1 source using different combinations of parameters for gcc-4.9.1 and Postgres. I followed this article but I was unable to test the -flto option due to several errors returned by make. After two days of testing I went down from 195 to 189 seconds on T420 where MacMini is 40 seconds (A set); and from 141 to 129 seconds where MacMini is 101 seconds (B set). I've used the following compiling options:

./configure CFLAGS="-O3 -fno-inline-functions -march=native" --with-openssl --with-libxml --with-libxslt --with-wal-blocksize=64 --with-blocksize=32 --with-wal-segsize=64 --with-segsize=1

I've also tried to disable Hyper-Threading with echo 0 > /sys/devices/system/cpu/cpuN/online where cpuN is the N-th logical CPU but nothing changed over B set queries. We have 2 CPU with 8 cores for a total of 16 physical cores and 16 logical cores.

UPDATE 3

Here are the query plans as before plus the T420 plans using the same configuration as MacMini (but different Postgres version); this way plans are identical, except from performance.

T420 with MacMini postgresql.conf

Query B_1 [51280.208ms + 0.699ms] http://explain.depesz.com/s/wlb

Query B_2 [177278.205ms + 0.428ms] http://explain.depesz.com/s/rzr

T420 with optimal postgresql.conf

Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk


Following are the postgresql.conf of T420 and MacMini used for all the tests.

T420 postgresql.conf

Normal operations

autovacuum = on
maintenance_work_mem = 512MB
work_mem = 512MB
wal_buffers = 64MB
effective_cache_size = 64GB # this helps A LOT in disk write speed when creating indexes
shared_buffers = 32GB
checkpoint_segments = 2000
checkpoint_completion_target = 1.0
effective_io_concurrency = 0 # 1 doesn’t make any substantial difference
max_connections = 10 # 20 doesn’t make any difference

Data loading (same as above with the following changes):

autovacuum = off
maintenance_work_mem = 64GB

MacMini postgresql.conf

Normal operations

autovacuum = on
maintenance_work_mem = 128MB
work_mem = 32MB
wal_buffers = 32MB
effective_cache_size = 800MB
shared_buffers = 512MB
checkpoint_segments = 32
checkpoint_completion_target = 1.0
effective_io_concurrency = 1
max_connections = 20

Data loading (same as above with the following changes):

autovacuum = off
maintenance_work_mem = 6GB

Best Answer

After asking on pgsql-performance list, Jeff Janes figured out that the cause was associated to the default collation used by Postgres (see this link for more informations). MacMini was using the much performing collation while Dell T420 was using the en/US collation.

T420 (Postgres 9.4.1)

                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 pen       | pen      | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(4 rows)

MacMini (Postgres 9.0.13)

                                    List of databases
       Name        |   Owner    | Encoding | Collation | Ctype |    Access privileges    
-------------------+------------+----------+-----------+-------+-------------------------
 caldav            | caldav     | UTF8     | C         | C     | 
 collab            | collab     | UTF8     | C         | C     | 
 device_management | _devicemgr | UTF8     | C         | C     | 
 pen               | pen        | UTF8     | C         | C     | 
 postgres          | _postgres  | UTF8     | C         | C     | 
 roundcubemail     | roundcube  | UTF8     | C         | C     | 
 template0         | _postgres  | UTF8     | C         | C     | =c/_postgres           +
                   |            |          |           |       | _postgres=CTc/_postgres
 template1         | _postgres  | UTF8     | C         | C     | =c/_postgres           +
                   |            |          |           |       | _postgres=CTc/_postgres
(8 rows)

After setting collation on T420 "C" the A transaction went from 195 seconds to 33 seconds against 40 seconds on Mac Mini; B type transaction went from 141 seconds to 78 seconds against 101 seconds on Mac Mini. This is the best performance improvement after modifing BIOS settings. Many kernel adjustments didn't provide significant improvements.

So, running the following command will initialize a new database with collation C and encoding UTF8:

/usr/local/pgsql/bin/initdb -D /path/to/your/data --no-locale --encoding=UTF8

I hope this post will help other people in the future. After reading about a lot of digressions about kernel, virtual memory, RAID controllers, disk cache, WAL and other tech stuff I never found someone talking about collations.