PostgreSQL Query Planner – How to Profile and Diagnose Memory and Time Issues

debuggingexecution-planexplainpostgresql

I'm fighting with a specific "generated" query (the SQL is generated from a proprietary system, and I won't be able to share it or the execution plan here), and running into simply massive memory usage and long query times. When running locally on my machine, i'm seeing 6.5gb of memory used during the query, and the query takes around 8 seconds to fully run. However the weird part is that almost all of the time is spent in query planning, and not much in the actual execution:

Planning time: 7453.148 ms  
Execution time: 159.609 ms  

Interestingly enough, PREPARE-ing the query and then EXECUTE-ing it later causes the memory usage to spike during planning, but not during execution.

I have never seen a query take that long to plan, and none of our other large queries get anywhere near that.

My question is how can I diagnose or debug what the query planner is doing here? What tools does PostgreSQL provide to allow me to profile and diagnose why this query in particular is taking so much time and memory in the query planning stage, and what we can do to reduce the memory usage from the query planner.

Things i've tried:

  • Disabling GEQO entirely – had no measurable effect on the memory used, but did marginally increase the planning time to ~9 seconds up from ~7.5 seconds.
  • Tweaking GEQO threshold – had no measurable effect whether set to 2 or 99999 or anything in between.
  • Tweaking from_collapse_limit and join_collapse_limit – these had a small impact on the query, taking the planning down from ~7 seconds to ~5 seconds but not changing the memory usage at all (and the memory usage is really my main concern here)

I want to dig into this more, but without knowing how to profile what the query planner is doing, I'm basically shooting in the dark…

Edit: this is in PostgreSQL 9.6. We should be able to upgrade to 10 fairly easily, but 11 and up will need more research into if it's feasible.

Best Answer

The first stop to investigating this be would the system tool "perf". You can get the pid of the running backend (by using "top", for example--make sure you get the pid of the backend, not the client program) then do sudo perf top -p <pid>. You might get more useful output by first installing the debug symbols. How you do that is going to depend on your OS and how you installed it.

But once you identify the bottleneck, I wouldn't hold out much hope that anyone will do anything about it, given the far-out nature of your query. If you are willing to dig into the source code to propose our own solution, then I would suggest you start by compiling your own server with version 13dev. You aren't going to get very far in your investigations using only packaged versions, and especially not older packaged versions. (Also, the problem might already be fixed in 12 or 13dev).