PostgreSQL Docker – How to Reduce Excessive CPU Consumption

cpudockerpostgresql

I'm using a Postgres container to run some small non-critical apps and sites. It's been stable for a while, but now the container has started to consume some serious CPU after it's been running for a short period of time. I have removed all other containers which use the Postgres container, and even after starting a new instance, the excessive CPU utilisation reoccurs. In my host (docker stats), I see this:

CONTAINER ID        NAME                                              
CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS cd553249727d        data_postgresql.1.ft2gof5jci25xs5w5uqw6eezi                
814.52%             22.11MiB / 46.95GiB   0.05%               129kB / 116kB       0B / 692kB          23

And this (top):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28923 70        20   0  633580  19664    488 S 696.7  0.0   2408:51 Dp2N

In the container (top), I see this:

Mem: 42042244K used, 7183656K free, 3622600K shrd, 1952K buff, 30585480K cached
CPU:  63% usr   9% sys   0% nic  26% idle   0% io   0% irq   0% sirq
Load average: 9.77 9.70 9.66 13/508 11090
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
   94     1 postgres S     618m   1%   3  58% ./Dp2N    <----- WTF?!?!?
   53    52 postgres S     1588   0%   1   1% {systemd} /bin/sh ./systemd
   47     1 postgres S     163m   0%   8   0% postgres: postgrea67 postgres 10.2
   22     1 postgres S     161m   0%   0   0% postgres: autovacuum launcher proc
   20     1 postgres S     161m   0%   8   0% postgres: writer process
   21     1 postgres S     161m   0%   5   0% postgres: wal writer process
    1     0 postgres S     161m   0%   0   0% postgres
   19     1 postgres S     161m   0%   8   0% postgres: checkpointer process
   23     1 postgres S    19988   0%   1   0% postgres: stats collector process
11081    53 postgres R     1588   0%   4   0% [systemd]
   33     0 root     S     1576   0%   9   0% sh
   52    47 postgres S     1568   0%  10   0% sh -c setsid ./systemd
   39    33 root     R     1508   0%  11   0% top
11083 11081 postgres Z        0   0%   5   0% [grep]
11084 11081 postgres Z        0   0%   4   0% [awk]

Query activity (no idea what select fun308928987('setsid ./systemd') does):

postgres=# select backend_start, usename, application_name, client_addr, client_hostname, query from pg_stat_activity;
         backend_start         |  usename   | application_name | client_addr | client_hostname |                                                    query
-------------------------------+------------+------------------+-------------+-----------------+-------------------------------------------------------------------------------------------------------------
 2018-05-23 07:34:14.694057+00 | postgres   | psql             |             |                 | select backend_start, usename, application_name, client_addr, client_hostname, query from pg_stat_activity;
 2018-05-23 01:26:55.235556+00 | postgrea67 |                  | 10.255.0.2  |                 | select fun308928987('setsid ./systemd');
 2018-05-23 07:26:03.519231+00 | postgrea67 |                  | 10.255.0.2  |                 | select fun308928987('setsid ./systemd');

In the service logs there are also a large amount of instances of this error:

data_postgresql.1.ft2gof5jci25@IS-57436    | ps: bad -o argument 'command', supported arguments: user,group,comm,args,pid,ppid,pgid,etime,nice,rgroup,ruser,time,tty,vsz,stat,rss

If I kill the Dp2N process within the container, CPU usage returns to normal, but then something immediately spins that process back up. I have googled to see if I can find any info on Dp2N, but to no avail. It's located in an externally mounted volume:

/ # ls -al /var/lib/postgresql/data/pgdata/Dp2N
-rwxrwxrwx    1 postgres postgres   1886536 May 22 23:25 /var/lib/postgresql/data/pgdata/Dp2N

but is seemingly created as it's not part of the base image as far as I can see.

I'm using postgres:9.6.9-alpine. The problem started with postgres:9.6.8-alpine, but upgrading didn't fix it. Any help would be greatly appreciated as this is driving me nuts!

Additional details

Results of running file:

sudo file /var/data/pgdata/pgdata/Dp2N
/var/data/pgdata/pgdata/Dp2N: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.24, BuildID[sha1]=bcb5ccf2bc22d1fcb0676506d7c7f31a9b7148bc, stripped

It turns out that Alpine comes with a limited version of the ps command. Running this:

apk --no-cache add procps

gets the enhanced version and prevents the ps related error in the logs. I've updated the Postgres image to include this, and so far the problem hasn't resurfaced. Speculation is that the CPU is being thrashed trying to execute the command repeatedly after failure.

Diagnosis

As per the answer below, it turns out I've been hacked. I'm currently at a loss to how they got in though. Server locked down to specific user with SSH cert / no password access and root disabled. ('last' only shows my accesses – unless it's been hacked.) No public access to postgresql. Very strong database admin password. Only accessed from 1 other container currently. Seems likely that they got in via the web sites on the server but only got as far as the container operating system in this case, not the host OS. FWIW I'm running a WordPress site, Grafana, Kibana, Traefik, Portainer and my own .NET based API. I'm starting off with a WordPress shakedown first, as I've experienced plug-in related infections with it before.

For educational purposes:

https://www.imperva.com/blog/2018/03/deep-dive-database-attacks-scarlett-johanssons-picture-used-for-crypto-mining-on-postgre-database/

Best Answer

You have been hacked, and are now mining cryptocurrency for the hacker.

They got in by guessing the password for your postgresql server's super-user account. Then they used the lo_export facility to drop the binary for a user-defined-function which executes arbitrary shell commands. That is what fun308928987 is, the SQL function which was created to wrap this binary.

Best clean up is to just destroy the server and rebuild it, this time setting up an actual strong password for the superuser account. Or better yet, also change pg_hba.conf to not allow super users connections, or preferably any connections, from the outside world.