Postgresql – How to debug an Idle Query

postgresqlpostgresql-9.3

I have a batch query that I'm running daily on my database. However, it seems to get stuck in idle state, and I'm having a lot of difficulty debugging what's going on.

The query is an aggregation on a table that is simultaneously getting inserted, which I'm guessing somehow relates to the issue. (The aggregation is on the previous days data, so the insertions shouldn't affect results.)

Clues

I'm running this inside a python script using sqlalchemy. However, I've set transaction level to autocommit, so I don't think things are getting wrapped inside a transaction. On the other hand, I don't see the query hang when I run it manually in sql terminal.
By querying pg_stat_activity, the query initially comes into the database as state='active'. After maybe 15 seconds, the state changes to 'idle' and additionally, the xact_start is set to NULL. The waiting flag is never set to true.
Before I figured out the transaction level autocommit for sqlalchemy, it would instead hang in state 'idle in transaction' rather than 'idle'. And it possibly hangs slightly less frequently since making that change?

I feel like I'm not equipped to dig any deeper than I have on this. Any feedback, even explaining more about different states and relevant postgres internals without giving a definite answer, would be greatly appreciated.

Best Answer

First thing you have to separate here are the words query, transaction and connection.

Clue: your query gets executed - it is in the active state. After that the query ends but the connection stays on - the idle state. There is no transaction (it has been committed) so the xact_start is null. So you have to close the connection after the query is succesful.
Clue: Before the autocommit was on the query was left in the middle of the transaction so first you would have to commit and then close connection.

Related Solutions

Postgresql – SELECT 1 – idle in transaction

idle in transaction means pretty much what it suggests: there is an open transaction doing nothing at the moment.

What you see can be easily reproduced. Do the following in one session (disable autocommit in your client if necessary - usually it isn't, an explicit BEGIN; will do that automatically):

test=>
    BEGIN;

    SELECT 1;

In an other session, let's see how it looks in pg_stat_activity:

test=> SELECT pid, 
              datname, 
              usename, 
              xact_start, 
              query_start, 
              state, 
              state_change, 
              query,
              now() 
       FROM pg_stat_activity;

pid          │ 13639
datname      │ test
usename      │ dezso
xact_start   │ 2015-10-23 10:14:36.853529+02
query_start  │ 2015-10-23 10:15:15.838252+02
state        │ idle in transaction
state_change │ 2015-10-23 10:15:15.838365+02
query        │ SELECT 1
now          │ 2015-10-23 10:16:24.340721+02

Here we see four different timestamps:

the time when the transaction started (xact_start),
the time when the query started (query_start),
the time when the transaction state last changed (state_change, I suppose it's the end of the query) and
the current timestamp (when the above query was run).

Between the last two the transaction is idling, but the last query is still shown.

To answer your question: many clients have autocommit disabled by default. In this case, a transaction is started silently when the first command (SELECT 1;, for example) is issued - and then never closed. In this case you'll see what you experience currently and I reproduced above.

AFAIK there is a way to persuade JBoss AS to turn it off. This might help: https://developer.jboss.org/wiki/Non-transactionaldataaccessandtheauto-commitmode

postgresql – Debugging Slow Queries on Large Tables

Table definition

A low hanging fruit first: The UNIQUE constraint details_id_key is a total waste of resources. It adds nothing useful over the existing PK details_pkey.

Where did you get these noisy DDL statements? All the redundant default clauses cloud the view. After trimming the noise:

CREATE TABLE public.details (
   value numeric,
   created_at timestamptz NOT NULL,
   updated_at timestamptz NOT NULL,
   effective_date timestamptz,
   submission_date timestamptz,
   id uuid NOT NULL,
   form_id uuid,
   order_id uuid,
   invoice_id uuid,
   customer_id uuid,
   seq int8 NOT NULL DEFAULT nextval('details_seq_seq'::regclass),
   client_day date,
   CONSTRAINT details_pkey PRIMARY KEY (id),
   CONSTRAINT details_id_key UNIQUE (id)  -- total waste
);

Query

I would have a couple of things to say to your query:

Use json_agg(t) instead of array_to_json(array_agg(t))?
Why join to orders at all? Can you join invoices to details directly?
```
JOIN invoices j2 ON j2.id = t1.invoice_id
```

Then I would ask for the definition of count_pages() (in your question) and for other details. But given your statement:

I've already developed a much improved version of this query that takes this specific scenario down from 985ms down to 20.

... I might be just wasting time. Rather have a separate question based on your improved version of the query - if you still need it.

Query plan gone wrong

Most of the total time is spent in one of the nested loop steps, where Postgres bases the plan on seriously underestimated row counts in both query plans:

   ->  Nested Loop  (cost=5.19..23499.04 rows=33 width=8)
             (actual time=1.964..929.479 rows=7166 loops=1)

   ->  Nested Loop  (cost=5.19..23499.35 rows=33 width=8)
             (actual time=0.275..277.738 rows=8413 loops=1)

That's the join between details and orders. _{(I am not even convinced you need orders in the query at all.)} The low estimates also build up in every step that leads up to this nested loop.

If Postgres would know to expect that many rows, it would probably choose a merge join or hash join instead.

The number of actual rows has grown in your second query plan, while the estimate has not. This would indicate that you are not running ANALYZE enough or that you just had big INSERTs adding more data and that's not reflected in the table statistics, yet. These two predicates in your query are typical suspects for such a situation:

AND t1.effective_date >= '2016-01-28T14:56:31.000Z'
AND t1.updated_at     >= '2016-02-07T21:29:50.000Z'

I would guess you add newer rows all the time, your autovacuum isn't keeping up with the latest stats on the newest rows while your query focuses on just those rows. This can escalate over time or after particularly big INSERTs.

Statistics for your big table

autovacuum is based on a percentage of rows inserted / updated / deleted. Default settings can be a bit of a trap for your special situation: a big table growing at a constant rate. Especially when mostly the latest additions are relevant for your query.

5M rows, 25K new rows per day. After 10 days, autovacuum "reckons":
Only 5 % new rows, my default autovacuum_analyze_scale_factor is 10 %. No need to run ANALYZE

The manual:

autovacuum_analyze_scale_factor (floating point)

Specifies a fraction of the table size to add to autovacuum_analyze_threshold when deciding whether to trigger an ANALYZE. The default is 0.1 (10% of table size). This parameter can only be set in the postgresql.conf file or on the server command line; but the setting can be overridden for individual tables by changing table storage parameters.

Bold emphasis mine. Read about the other autovacuum setting, too!

I suggest more aggressive settings - at least for your all-important table details. Per-table settings are possible:

It is possible to run ANALYZE on specific tables and even just specific columns of a table, so the flexibility exists to update some statistics more frequently than others if your application requires it.

Use Storage Parameters for your table. Example:

ALTER TABLE public.details SET (autovacuum_analyze_scale_factor = 0.001);

When is Postgres autovacuum executed

And run ANALYZE manually after big INSERTs with new data, before following up immediately with a (possibly) expensive query involving those newly added rows. Quoting the manual once more:

We recommend that active production databases be vacuumed frequently (at least nightly), in order to remove dead rows. After adding or deleting a large number of rows, it might be a good idea to issue a VACUUM ANALYZE command for the affected table. This will update the system catalogs with the results of all recent changes, and allow the PostgreSQL query planner to make better choices in planning queries.

Index

Less importantly, I would also consider a multicolumn index:

CREATE INDEX details_order_id_timestamps ON public.details
   (order_id, effective_date DESC NULLS LAST, updated_at DESC NULLS LAST);

I also do not see any index-only scans, which can also point towards VACUUM not running enough (compare the manual here) or that there are no covering indexes, or both. It might pay to review which of your many indexes are actually used and which might be missing.