PostgreSQL – Estimating End Time for a Long Update

postgresql

I have currently running a long update (~million rows based on a postgis st_within condition) that was expected to finish in an hour or so based on similar previous queries. However, it's been three hours already; system monitoring shows that it's pegged a single CPU at 100% for that time.

Is there any way to view the progress of this update, to estimate when it is expected to end?

Best Answer

I know this question is old but I monitore all my updates in postgresql including updates inside transactions.. So I decided to help!

1- First create a sequence named myprogress for example.

create sequence myprogress;

2- Make your update to hit the sequence - now you are free to start a transaction:

For example if you have this update:

update mytable set c=3 where a=0 and b=1;

Just rewrite it to (plus this and condition):

update mytable set c=3 where ( a=0 and b=1 ) and  NEXTVAL('myprogress')!=0;

**You are ok to run this update inside a transaction, as sequences are affected globally. **

Note that this will not significantly affect performance.

3- Monitore your progress. Now you can just connect in another session (or a transaction) and select your sequence (sequences are affected globally , so you will see the value in the another sessions):

select last_value from myprogress;

And you will see how many lines are affected by your query until now, with that you can estimate how many lines by second are hit by your update.

And in most of the cases how many time you will need to wait...

4- At end just restart the sequence:

alter sequence myprogress restart with 1;

So you can use it again - But carefully, to do not trust this if two users use the same sequence. If you are in doubt better if you have your own progress sequence with permissions only for you.

You can use it for slow SELECTs, DELETEs, and for some INSERTs too !

Related Solutions

PostgreSQL Creating Inefficient Plans in Conditional Joins

Then it looks like an optimizer's blind spot and you should use the second query.

When there is a condition joining two tables a and b: a.id = b.id and an additional condition a.id > @some_constant, seems like the optimizer uses the "index condition" for where to start the index scan on a (id) index but it doesn't use it for the second index b (id).

So, adding the (redundant) b.id > @some_constant allows it to produce a slightly more efficient plan, skipping a part of the b (id) index as well.

This could be posted as a suggestion for improvement (if it hasn't been already) to the Postgres hackers group.

After the edit, we know there is a FOREIGN KEY constraint from t2 that REFERENCES t1. So the "natural" (equivalent) way to write the query would be:

SELECT
    -- whatever
FROM
    t2
  LEFT JOIN
    t1 ON t1.id = t2.id
WHERE t2.id > -9223372036513411363 ;

_{Can you try this and tell us the execution plan it produces? There are some transformations that apply only to LEFT (outer) joins and not to inner joins.}
Unfortunately this doesn't produce any different plan either.

The OP has posted a question at the Postgres performance list, we can see the whole thread here: PostgreSQL seems to create inefficient plans in simple conditional joins and the reply by David Rowley, which confirms that this is a feature that although it has been considered, hasn't yet been implemented in the optimizer:

Yes, unfortunately you've done about the only thing that you can do, and that's just include both conditions in the query. Is there some special reason why you can't just write the t2.id > ... condition in the query too? or is the query generated dynamically by some software that you have no control over?

I'd personally quite like to see improvements in this area, and even wrote a patch 1 which fixes this problem too. The problem I had when proposing the fix for this was that I was unable to report details about how many people are hit by this planner limitation. The patch I proposed caused a very small impact on planning time for many queries, and was thought by many not to apply in enough cases for it to be worth slowing down queries which cannot possibly benefit. Of course I agree with this, I've no interest in slowing down planning on queries, but at the same time understand the annoying poor optimisation in this area.

Although please remember the patch I proposed was merely a first draft proposal. Not for production use.

Postgresql – Log plans for queries that never finish

The auto_explain feature seems to fit this use case, if you can take a brief service outage to enable it. It logs a plan for any statement exceeding a time threshold whether or not it has completed yet. It can also log plans before execution starts, so you can even get the plan of a statement that blows out memory!

Best Answer

Related Solutions

PostgreSQL Creating Inefficient Plans in Conditional Joins

Postgresql – Log plans for queries that never finish

Related Question