Postgresql – Extremely long postgres query

postgresql

I've run a migration query within a transaction, it went like this:

alter table big_and_loaded_table
    add column col1 bool;

update big_and_loaded_table set col1 = false;

It seems quite innocent, but it took about two minutes to execute. Moreover, it "locked" big_and_loaded_table: any query in my application that involved it — both read and write — took oh so long to execute, the same minute or so. I don't use transactions in my application at all, as well as any kind of explicit locks.

So I have two questions:

Why did that migration query take so long to execute? Was it because of a transaction? Or was it because of the query itself?
Why did it lock application queries that involved table big_and_loaded_table?
How should I run such kinds of migrations in the future? Probably this point won't make any sense when I'll figure out the first two though.

Postgres version is 11.6.

Best Answer

Postgres 11 added a feature where a table rewrite is not necessary if you add a column with a default value.

If you had used

alter table big_and_loaded_table
    add column col1 boolean default false;

the statement would have executed nearly instantaneous with the same effect.

But an ALTER TABLE statement will acquire an exclusive lock to prevent access to the table while it's being changed, there is no way around it.

Related Solutions

Postgresql – Subquery magic in PostgreSQL

why didn`t the first query fail?

Because this is valid SQL.

First of all consider that it is valid to run a SELECT against the table that doesn't reference any columns from that table.

SELECT DISTINCT 'foo'
FROM   product_template
WHERE  type = 'import';

The above will return a single row result of foo if any rows exist that match the WHERE clause.

Secondly consider that it is valid to reference a column from the outer table in a sub query (this is required for correlated sub queries to work).

In the event that the column name is not qualified with a table reference then it will be resolved in the inner scope if possible and outer scopes if not. In your case product_template has no such column so it was resolved as belonging to product_product.

As a best practice be explicit about the tables the columns belong to. Had the subquery been written as below it would have failed to compile and alerted you of the error.

IN (SELECT DISTINCT pt.product_tmpl_id
    FROM   product_template pt
    WHERE  pt.type = 'import');

what exactly did it do?

This depends. If FROM product_template WHERE type = 'import'; returns zero rows then you were lucky and this is equivalent to

update product_product 
set (write_date, default_code) = (LOCALTIMESTAMP, 'update') 
where product_tmpl_id = null; /*Never true*/

Probably you weren't lucky though and it did return at least one row. In this case you ran the equivalent of the following

update product_product 
set (write_date, default_code) = (LOCALTIMESTAMP, 'update') 
where product_tmpl_id = product_tmpl_id;

which is equivalent to

update product_product 
set (write_date, default_code) = (LOCALTIMESTAMP, 'update') 
where product_tmpl_id IS NOT NULL;

why took it so long to run?

I imagine as firstly it updated all rows in the table.

I'm not sure what the execution plans for this would be like in Postgres too.

In the worst case it might have been selecting all rows matching the where from product_template, passing in the correlated parameter, then performing DISTINCT on the result for each row in the outer product_product table.

PostGIS – Why This Query Takes So Long to Run

Your query is needlessly complicated and can be untangled to this 100 % equivalent one:

SELECT gid, id, ST_MakeValid(geom) AS geom
FROM   schema.table_polygons
WHERE  geom IS NOT NULL;

Neither original nor this one remove any duplicates (except by possibly removing some with geom IS NULL, but I doubt that was your intention).

Possible misunderstanding

There is a common misunderstanding concerning output column names in the WHERE clause - where you can only reference input column names. So, geom in:

WHERE geom IN ...

references the original input column not the output column (the result of ST_MakeValid(geom)).

PostgreSQL Where count condition

If your actual intention was to only keep rows where the result of ST_MakeValid(geom) is equal to any currently existing geom entry:

SELECT gid, id, ST_MakeValid(t.geom) AS geom 
FROM   schema.table_polygons t
WHERE  EXISTS (
   SELECT 1
   FROM   schema.table_polygons
   WHERE  geom = ST_MakeValid(t.geom)
   );

I added a table alias and table qualification to clarify what's what.

The index you mention is only going to help with this variant.

If your actual intention was to merge duplicate rows after fixing geom values:

SELECT DISTINCT ON (ST_MakeValid(t.geom))
       gid, id, ST_MakeValid(geom) AS geom 
FROM   schema.table_polygons t
ORDER  BY ST_MakeValid(t.geom), gid, id;

You did not define how to break ties and which row to keep from a set of dupes. I keep the row with the smallest gid from each set. And the smallest id from remaining dupes. Details:

Select first row in each GROUP BY group?

Aside: Calling a schema "schema" is like naming your son "son" and hoping for the best that no more sons shall arrive. In other words: don't.

Best Answer

Related Solutions

Postgresql – Subquery magic in PostgreSQL

PostGIS – Why This Query Takes So Long to Run

Possible misunderstanding

Related Question