I'm trying to run queries on a large-ish database without killing the connection to the server.
I'm using Postgres 12.1 on a mac with 16gb of memory, and about 40gb of free disk. The database is 78gb according to pg_database_size
with the largest table being 20gb according do pg_total_relation_size
.
The error I get (from the log), regardless of which non-working query I run, is:
server process (PID xxx) was terminated by signal 9: Killed: 9
In VS code the error is "lost connection to server"
.
Two examples that don't work are:
UPDATE table
SET column = NULL
WHERE column = 0;
select columnA
from table1
where columnA NOT IN (
select columnB
from table2
);
I can run some of the queries (the above one, for example) by adding a LIMIT
of, say, 1,000,000.
I suspected that I was running out of disk due to temp files, but in the log (with log_temp_files = 0
), I can't see any temp files being written.
I tried increasing and decreasing work_mem
, maintenance_work_mem
, shared_buffers
, and temp_buffers
. None worked, the performance was about the same.
I tried dropping all indexes, which brought down the "cost" on some of the queries, but they still killed the connection to the server.
What could be my problem and how can I troubleshoot this further?
Additionally, I read that temp files from timed-out queries are stored in pqsql_tmp. I checked the folder, and it does not have files of significant size. Could the temp files be stored somewhere else?
The posgtres log for running a failed query looks like:
2020-02-17 09:31:08.626 CET [94908] LOG: server process (PID xxx) was terminated by signal 9: Killed: 9
2020-02-17 09:31:08.626 CET [94908] DETAIL: Failed process was running: update table
set columnname = NULL
where columnname = 0;
2020-02-17 09:31:08.626 CET [94908] LOG: terminating any other active server processes
2020-02-17 09:31:08.626 CET [94919] WARNING: terminating connection because of crash of another server process
2020-02-17 09:31:08.626 CET [94919] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exi$
2020-02-17 09:31:08.626 CET [94919] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-02-17 09:31:08.626 CET [94914] WARNING: terminating connection because of crash of another server process
2020-02-17 09:31:08.626 CET [94914] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exi$
2020-02-17 09:31:08.626 CET [94914] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-02-17 09:31:08.629 CET [94908] LOG: all server processes terminated; reinitializing
2020-02-17 09:31:08.698 CET [94927] LOG: database system was interrupted; last known up at 2020-02-17 09:30:57 CET
2020-02-17 09:31:08.901 CET [94927] LOG: database system was not properly shut down; automatic recovery in progress
2020-02-17 09:31:08.906 CET [94927] LOG: invalid record length at 17/894C438: wanted 24, got 0
2020-02-17 09:31:08.906 CET [94927] LOG: redo is not required
Running EXPLAIN
on the second example query returns:
Seq Scan on gas_prices_all p (cost=459.93..5635583.33 rows=128975016 width=16)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on gas_station g (cost=0.00..423.14 rows=14714 width=16)
JIT:
Functions: 13
Options: Inlining true, Optimization true, Expressions true, Deforming true
FWIW I get the same error for queries with "cost" around zero.
Update: I found a crash report from earlier today:
Process: postgres [41042]
Path: /Users/USER/*/postgres
Identifier: postgres
Version: 0
Code Type: X86-64 (Native)
Parent Process: postgres [40107]
Responsible: postgres [41042]
User ID: 502
Date/Time: 2020-02-18 11:16:26.210 +0100
OS Version: Mac OS X 10.14.5 (18F132)
Report Version: 12
Anonymous UUID: F41CCD21-C558-6CB0-316D-D1FF3E279576
Sleep/Wake UUID: 5F08EAEC-373A-4D19-A243-E812E68D2697
Time Awake Since Boot: 1600000 seconds
Time Since Wake: 5700 seconds
System Integrity Protection: enabled
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes: 0x0000000000000032, 0x00000001044c4060
Exception Note: EXC_CORPSE_NOTIFY
Termination Reason: Namespace CODESIGNING, Code 0x2
kernel messages:
VM Regions Near 0x1044c4060:
__LINKEDIT 0000000104466000-00000001044c4000 [ 376K] r--/rw- SM=COW /Users/USER/*/*.dylib
--> VM_ALLOCATE 00000001044c4000-00000001044c5000 [ 4K] r-x/rwx SM=ZER
VM_ALLOCATE 00000001044c5000-00000001044c6000 [ 4K] rw-/rwx SM=ZER
Application Specific Information:
crashed on child side of fork pre-exec
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 ??? 0x00000001044c4060 0 + 4367073376
1 postgres 0x0000000103798851 ExecAgg + 1185 (executor.h:308)
2 postgres 0x0000000103785d0f standard_ExecutorRun + 287 (execMain.c:1652)
3 postgres 0x0000000103789c4e ParallelQueryMain + 670 (execParallel.c:1405)
4 postgres 0x00000001036520ce ParallelWorkerMain + 1054 (parallel.c:1434)
5 postgres 0x000000010385bec5 StartBackgroundWorker + 533 (bgworker.c:834)
6 postgres 0x000000010386acb9 maybe_start_bgworkers + 1161
7 postgres 0x00000001038696c5 sigusr1_handler + 357 (postmaster.c:5167)
8 libsystem_platform.dylib 0x00007fff76195b5d _sigtramp + 29
9 ??? 0x0000000000003200 0 + 12800
10 postgres 0x00000001037d54ae main + 1678
11 libdyld.dylib 0x00007fff75faa3d5 start + 1
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x00000001044c4060 rbx: 0x00007f98b9045600 rcx: 0x00000001072c48d8 rdx: 0x00007ffeec6242f4
rdi: 0x00007f98b9047520 rsi: 0x00007f98b9045fc0 rbp: 0x00007ffeec624320 rsp: 0x00007ffeec624288
r8: 0x00000000014bafca r9: 0xffffffff00000000 r10: 0x00000001072c48d0 r11: 0x0000000000000005
r12: 0x0000000103c51220 r13: 0x00007f98b9047510 r14: 0x00007f98b9045fc0 r15: 0x00007f98b90459a0
rip: 0x00000001044c4060 rfl: 0x0000000000010246 cr2: 0x00000001044c4060
Logical CPU: 0
Error Code: 0x00000015
Trap Number: 14
Binary Images:
**lots of stuff**
External Modification Summary:
Calls made by other processes targeting this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by all processes on this machine:
task_for_pid: 134420121
thread_create: 0
thread_set_state: 0
VM Region Summary:
ReadOnly portion of Libraries: Total=398.8M resident=0K(0%) swapped_out_or_unallocated=398.8M(100%)
Writable regions: Total=4.2G written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=4.2G(100%)
VIRTUAL REGION
REGION TYPE SIZE COUNT (non-coalesced)
=========== ======= =======
Kernel Alloc Once 8K 1
MALLOC 81.0M 19
MALLOC guard page 16K 3
MALLOC_LARGE (reserved) 72K 2 reserved VM address space (unallocated)
STACK GUARD 56.0M 1
Stack 8192K 1
VM_ALLOCATE 4.1G 4
__DATA 18.0M 200
__FONT_DATA 4K 1
__LINKEDIT 232.2M 11
__TEXT 166.6M 199
__UNICODE 564K 1
mapped file 64K 1
shared memory 552K 6
=========== ======= =======
TOTAL 4.7G 450
TOTAL, minus reserved VM space 4.7G 450
I can find two maybe-interesting things in the log:
2020-02-18 19:01:52.044375+0100 localhost kernel[0]: CODE SIGNING: process 51528[postgres]: rejecting invalid page at address 0x1100c1000 from offset 0x0 in file "<nil>" (cs_mtime:0.0 == mtime:0.0) (signed:0 validated:0 tainted:0 nx:0 wpmapped:1 dirty:0 depth:0)
2020-02-18 19:01:52.044805+0100 localhost ReportCrash[52560]: unknown nested kcdata type: 0x1004, size: 2108
2020-02-18 19:01:55.268060+0100 localhost ReportCrash[52560]: (CrashReporterSupport) Saved crash report for postgres[51528] version 0 to postgres_2020-02-18-190155_MacBook-Pro.crash
2020-02-18 19:01:55.273159+0100 localhost ReportCrash[52560]: (CrashReporterSupport) Removing excessive log: postgres_2020-02-18-190155_MacBook-Pro.crash
2020-02-18 19:01:55.274208+0100 localhost ReportCrash[52560]: shouldDisplayUnexpectedlyQuitNotification is NO
UPDATE: I ran the second exaple query (above) with LIMIT
until it went from working (LIMIT
of 2,200,000) to killing the connection to the server (LIMIT
of 2,300,000). EXPLAIN ANALYZE
on the LIMIT
2,200,000 query is:
Limit (cost=459.93..96581.42 rows=2200000 width=16) (actual time=13.228..38573.440 rows=2200000 loops=1)
-> Seq Scan table1 (cost=459.93..5635583.33 rows=128975016 width=16) (actual time=13.227..38374.070 rows=2200000 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 139729529
SubPlan 1
-> Seq Scan on table2 (cost=0.00..423.14 rows=14714 width=16) (actual time=0.350..6.925 rows=14714 loops=1)
Planning Time: 0.138 ms
Execution Time: 38685.762 ms
EXPLAIN
on the LIMIT
2,300,000 is: (EXPLAIN ANALYZE
crashed here)
Limit (cost=459.93..100950.58 rows=2300000 width=16)
-> Seq Scan on table1 (cost=459.93..5635583.33 rows=128975016 width=16)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on table2 (cost=0.00..423.14 rows=14714 width=16)
JIT:
Functions: 14
Options: Inlining false, Optimization false, Expressions true, Deforming true
I'm reading this as JIT kicking in here, which makes sense given jit_above_cost
is set (default) to 100,000. So perhaps the problem is with the JIT somehow?
UPDATE 2: With jit=off
the second example query works and is also ca. twice as fast. How can that be and what could cause JIT to be a problem on my system?
Best Answer
"signal 9" indicates this is almost certainly the Out Of Memory killer. You could look in
/var/log/kern.log
or whatever it is called on your system to verify. Running out of disk space on temp files would not result in this error.You should do an
EXPLAIN
on the query to see what plan it is using. I would usually guess it was a deranged HashAgg, except your first query should not be doing any aggregation, Hash or otherwise.You could also try turning off overcommit. That way your processes would get ordinary out of memory errors which they could then report in a orderly fashion, rather than just getting nuked from orbit with no chance to report the problem. There are instructions for Linux, maybe you can adapt them for MacOS. If you can catch the process once memory usage has started growing a lot, but before it is killed, you can get a report of where the memory is being allocated.