Postgresql – Queries on large database kill connection to the server, works with LIMIT

I'm trying to run queries on a large-ish database without killing the connection to the server.

I'm using Postgres 12.1 on a mac with 16gb of memory, and about 40gb of free disk. The database is 78gb according to pg_database_size with the largest table being 20gb according do pg_total_relation_size.

The error I get (from the log), regardless of which non-working query I run, is:

server process (PID xxx) was terminated by signal 9: Killed: 9

In VS code the error is "lost connection to server".

Two examples that don't work are:

UPDATE table
SET column = NULL
WHERE column = 0;

select columnA
from table1
where columnA NOT IN (
select columnB
from table2
);

I can run some of the queries (the above one, for example) by adding a LIMIT of, say, 1,000,000.

I suspected that I was running out of disk due to temp files, but in the log (with log_temp_files = 0), I can't see any temp files being written.

I tried increasing and decreasing work_mem, maintenance_work_mem, shared_buffers, and temp_buffers. None worked, the performance was about the same.

I tried dropping all indexes, which brought down the "cost" on some of the queries, but they still killed the connection to the server.

What could be my problem and how can I troubleshoot this further?

Additionally, I read that temp files from timed-out queries are stored in pqsql_tmp. I checked the folder, and it does not have files of significant size. Could the temp files be stored somewhere else?

The posgtres log for running a failed query looks like:

2020-02-17 09:31:08.626 CET [94908] LOG:  server process (PID xxx) was terminated by signal 9: Killed: 9
2020-02-17 09:31:08.626 CET [94908] DETAIL:  Failed process was running: update table
        set columnname = NULL
        where columnname = 0;

2020-02-17 09:31:08.626 CET [94908] LOG:  terminating any other active server processes
2020-02-17 09:31:08.626 CET [94919] WARNING:  terminating connection because of crash of another server process
2020-02-17 09:31:08.626 CET [94919] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exi$
2020-02-17 09:31:08.626 CET [94919] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-02-17 09:31:08.626 CET [94914] WARNING:  terminating connection because of crash of another server process
2020-02-17 09:31:08.626 CET [94914] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exi$
2020-02-17 09:31:08.626 CET [94914] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-02-17 09:31:08.629 CET [94908] LOG:  all server processes terminated; reinitializing
2020-02-17 09:31:08.698 CET [94927] LOG:  database system was interrupted; last known up at 2020-02-17 09:30:57 CET
2020-02-17 09:31:08.901 CET [94927] LOG:  database system was not properly shut down; automatic recovery in progress
2020-02-17 09:31:08.906 CET [94927] LOG:  invalid record length at 17/894C438: wanted 24, got 0
2020-02-17 09:31:08.906 CET [94927] LOG:  redo is not required

Running EXPLAIN on the second example query returns:

Seq Scan on gas_prices_all p  (cost=459.93..5635583.33 rows=128975016 width=16)
  Filter: (NOT (hashed SubPlan 1))
  SubPlan 1
    ->  Seq Scan on gas_station g  (cost=0.00..423.14 rows=14714 width=16)
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true

FWIW I get the same error for queries with "cost" around zero.

Update: I found a crash report from earlier today:

Process:               postgres [41042]
Path:                  /Users/USER/*/postgres
Identifier:            postgres
Version:               0
Code Type:             X86-64 (Native)
Parent Process:        postgres [40107]
Responsible:           postgres [41042]
User ID:               502

Date/Time:             2020-02-18 11:16:26.210 +0100
OS Version:            Mac OS X 10.14.5 (18F132)
Report Version:        12
Anonymous UUID:        F41CCD21-C558-6CB0-316D-D1FF3E279576

Sleep/Wake UUID:       5F08EAEC-373A-4D19-A243-E812E68D2697

Time Awake Since Boot: 1600000 seconds
Time Since Wake:       5700 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes:       0x0000000000000032, 0x00000001044c4060
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace CODESIGNING, Code 0x2

kernel messages:

VM Regions Near 0x1044c4060:
    __LINKEDIT             0000000104466000-00000001044c4000 [  376K] r--/rw- SM=COW  /Users/USER/*/*.dylib
--> VM_ALLOCATE            00000001044c4000-00000001044c5000 [    4K] r-x/rwx SM=ZER  
    VM_ALLOCATE            00000001044c5000-00000001044c6000 [    4K] rw-/rwx SM=ZER  

Application Specific Information:
crashed on child side of fork pre-exec

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   ???                             0x00000001044c4060 0 + 4367073376
1   postgres                        0x0000000103798851 ExecAgg + 1185 (executor.h:308)
2   postgres                        0x0000000103785d0f standard_ExecutorRun + 287 (execMain.c:1652)
3   postgres                        0x0000000103789c4e ParallelQueryMain + 670 (execParallel.c:1405)
4   postgres                        0x00000001036520ce ParallelWorkerMain + 1054 (parallel.c:1434)
5   postgres                        0x000000010385bec5 StartBackgroundWorker + 533 (bgworker.c:834)
6   postgres                        0x000000010386acb9 maybe_start_bgworkers + 1161
7   postgres                        0x00000001038696c5 sigusr1_handler + 357 (postmaster.c:5167)
8   libsystem_platform.dylib        0x00007fff76195b5d _sigtramp + 29
9   ???                             0x0000000000003200 0 + 12800
10  postgres                        0x00000001037d54ae main + 1678
11  libdyld.dylib                   0x00007fff75faa3d5 start + 1

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x00000001044c4060  rbx: 0x00007f98b9045600  rcx: 0x00000001072c48d8  rdx: 0x00007ffeec6242f4
  rdi: 0x00007f98b9047520  rsi: 0x00007f98b9045fc0  rbp: 0x00007ffeec624320  rsp: 0x00007ffeec624288
   r8: 0x00000000014bafca   r9: 0xffffffff00000000  r10: 0x00000001072c48d0  r11: 0x0000000000000005
  r12: 0x0000000103c51220  r13: 0x00007f98b9047510  r14: 0x00007f98b9045fc0  r15: 0x00007f98b90459a0
  rip: 0x00000001044c4060  rfl: 0x0000000000010246  cr2: 0x00000001044c4060

Logical CPU:     0
Error Code:      0x00000015
Trap Number:     14


Binary Images:
      **lots of stuff**

External Modification Summary:
  Calls made by other processes targeting this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by all processes on this machine:
    task_for_pid: 134420121
    thread_create: 0
    thread_set_state: 0

VM Region Summary:
ReadOnly portion of Libraries: Total=398.8M resident=0K(0%) swapped_out_or_unallocated=398.8M(100%)
Writable regions: Total=4.2G written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=4.2G(100%)

                                VIRTUAL   REGION 
REGION TYPE                        SIZE    COUNT (non-coalesced) 
===========                     =======  ======= 
Kernel Alloc Once                    8K        1 
MALLOC                            81.0M       19 
MALLOC guard page                   16K        3 
MALLOC_LARGE (reserved)             72K        2         reserved VM address space (unallocated)
STACK GUARD                       56.0M        1 
Stack                             8192K        1 
VM_ALLOCATE                        4.1G        4 
__DATA                            18.0M      200 
__FONT_DATA                          4K        1 
__LINKEDIT                       232.2M       11 
__TEXT                           166.6M      199 
__UNICODE                          564K        1 
mapped file                         64K        1 
shared memory                      552K        6 
===========                     =======  ======= 
TOTAL                              4.7G      450 
TOTAL, minus reserved VM space     4.7G      450

I can find two maybe-interesting things in the log:

2020-02-18 19:01:52.044375+0100  localhost kernel[0]: CODE SIGNING: process 51528[postgres]: rejecting invalid page at address 0x1100c1000 from offset 0x0 in file "<nil>" (cs_mtime:0.0 == mtime:0.0) (signed:0 validated:0 tainted:0 nx:0 wpmapped:1 dirty:0 depth:0)
2020-02-18 19:01:52.044805+0100  localhost ReportCrash[52560]: unknown nested kcdata type: 0x1004, size: 2108

2020-02-18 19:01:55.268060+0100  localhost ReportCrash[52560]: (CrashReporterSupport) Saved crash report for postgres[51528] version 0 to postgres_2020-02-18-190155_MacBook-Pro.crash
2020-02-18 19:01:55.273159+0100  localhost ReportCrash[52560]: (CrashReporterSupport) Removing excessive log: postgres_2020-02-18-190155_MacBook-Pro.crash
2020-02-18 19:01:55.274208+0100  localhost ReportCrash[52560]: shouldDisplayUnexpectedlyQuitNotification is NO

UPDATE: I ran the second exaple query (above) with LIMIT until it went from working (LIMIT of 2,200,000) to killing the connection to the server (LIMIT of 2,300,000). EXPLAIN ANALYZE on the LIMIT 2,200,000 query is:

Limit  (cost=459.93..96581.42 rows=2200000 width=16) (actual time=13.228..38573.440 rows=2200000 loops=1)
  ->  Seq Scan table1  (cost=459.93..5635583.33 rows=128975016 width=16) (actual time=13.227..38374.070 rows=2200000 loops=1)
        Filter: (NOT (hashed SubPlan 1))
        Rows Removed by Filter: 139729529
        SubPlan 1
          ->  Seq Scan on table2  (cost=0.00..423.14 rows=14714 width=16) (actual time=0.350..6.925 rows=14714 loops=1)
Planning Time: 0.138 ms
Execution Time: 38685.762 ms

EXPLAIN on the LIMIT 2,300,000 is: (EXPLAIN ANALYZE crashed here)

Limit  (cost=459.93..100950.58 rows=2300000 width=16)
  ->  Seq Scan on table1  (cost=459.93..5635583.33 rows=128975016 width=16)
        Filter: (NOT (hashed SubPlan 1))
        SubPlan 1
          ->  Seq Scan on table2  (cost=0.00..423.14 rows=14714 width=16)
JIT:
  Functions: 14
  Options: Inlining false, Optimization false, Expressions true, Deforming true

I'm reading this as JIT kicking in here, which makes sense given jit_above_cost is set (default) to 100,000. So perhaps the problem is with the JIT somehow?

UPDATE 2: With jit=off the second example query works and is also ca. twice as fast. How can that be and what could cause JIT to be a problem on my system?

Best Answer

"signal 9" indicates this is almost certainly the Out Of Memory killer. You could look in /var/log/kern.log or whatever it is called on your system to verify. Running out of disk space on temp files would not result in this error.

You should do an EXPLAIN on the query to see what plan it is using. I would usually guess it was a deranged HashAgg, except your first query should not be doing any aggregation, Hash or otherwise.

You could also try turning off overcommit. That way your processes would get ordinary out of memory errors which they could then report in a orderly fashion, rather than just getting nuked from orbit with no chance to report the problem. There are instructions for Linux, maybe you can adapt them for MacOS. If you can catch the process once memory usage has started growing a lot, but before it is killed, you can get a report of where the memory is being allocated.

Best Answer

Related Solutions

Postgresql – Running a UPDATE on 3 million records

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question