Postgresql – Subquery magic in PostgreSQL

postgresqlpostgresql-9.4subquery

I have a query:

update product_product 
set (write_date, default_code) = (LOCALTIMESTAMP, 'update') 
where product_tmpl_id in (
    select distinct product_tmpl_id 
    from product_template 
    where type='import');

And it takes 7 hours to complete. However, when I execute the subquery:

select distinct product_tmpl_id 
from product_template 
where type='import';

I get the error:

column "product_tmpl_id" does not exist

There isn't a column product_tmpl_id in the table product_template.

It was my mistake. The first query should have been:

update product_product set (write_date,default_code) = (LOCALTIMESTAMP,'update') 
where product_tmpl_id in
 (select distinct id from product_template where type='import');

And it onle takes several seconds to run.

So my questions are the following:

why didn`t the first query fail?
why took it so long to run?
what exactly did it do?

The result of select version(); is

PostgreSQL 9.4.3 on x86_64-unknown-linux-gnu, 
compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit

Best Answer

why didn`t the first query fail?

Because this is valid SQL.

First of all consider that it is valid to run a SELECT against the table that doesn't reference any columns from that table.

SELECT DISTINCT 'foo'
FROM   product_template
WHERE  type = 'import';

The above will return a single row result of foo if any rows exist that match the WHERE clause.

Secondly consider that it is valid to reference a column from the outer table in a sub query (this is required for correlated sub queries to work).

In the event that the column name is not qualified with a table reference then it will be resolved in the inner scope if possible and outer scopes if not. In your case product_template has no such column so it was resolved as belonging to product_product.

As a best practice be explicit about the tables the columns belong to. Had the subquery been written as below it would have failed to compile and alerted you of the error.

IN (SELECT DISTINCT pt.product_tmpl_id
    FROM   product_template pt
    WHERE  pt.type = 'import');

what exactly did it do?

This depends. If FROM product_template WHERE type = 'import'; returns zero rows then you were lucky and this is equivalent to

update product_product 
set (write_date, default_code) = (LOCALTIMESTAMP, 'update') 
where product_tmpl_id = null; /*Never true*/

Probably you weren't lucky though and it did return at least one row. In this case you ran the equivalent of the following

update product_product 
set (write_date, default_code) = (LOCALTIMESTAMP, 'update') 
where product_tmpl_id = product_tmpl_id;

which is equivalent to

update product_product 
set (write_date, default_code) = (LOCALTIMESTAMP, 'update') 
where product_tmpl_id IS NOT NULL;

why took it so long to run?

I imagine as firstly it updated all rows in the table.

I'm not sure what the execution plans for this would be like in Postgres too.

In the worst case it might have been selecting all rows matching the where from product_template, passing in the correlated parameter, then performing DISTINCT on the result for each row in the outer product_product table.

Similarity operator `%`

First of all, whatever else you do, use the similarity operator % instead of the expression (similarity(job_title, 'sales executive') > 0.6). Much cheaper. And index support is bound to operators in Postgres, not to functions.

To get the desired minimum similarity of 0.6, run:

SELECT set_limit(0.6);

The setting stays for the rest of your session unless reset to something else. Check with:

SELECT show_limit();

This is a bit clumsy, but great for performance.

Simple case

If you just wanted the best matches in column job_title for the string 'sales executive' then this would be a simple case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator class gist_trgm_ops (but not with a GIN index):

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);

To also include an equality condition on worksite_city you would need the additional module btree_gist. Run (once per DB):

CREATE EXTENSION btree_gist;

Then:

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);

Query:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY (job_title <-> 'sales executive')
LIMIT  50;

<-> being the "distance" operator:

one minus the similarity() value.

Postgres can also combine two separate indexes, a plain btree index on worksite_city, and a separate GiST index on job_title, but the multicolumn index should be fastest - if you combine the two columns like this in queries regularly.

Your case

However, your query sorts by salary, not by distance / similarity, which changes the nature of the game completely. Now we can use both GIN and GiST index, and GIN will be faster (even more so in Postgres 9.4 which has largely improved GIN indexes - hint!)

Similar story for the additional equality check on worksite_city: install the additional module btree_gin. Run (once per DB):

CREATE EXTENSION btree_gin;

Then:

CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);

Query:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY salary 
LIMIT  50 -- OFFSET 0

Again, this should also work (less efficiently) with the simpler index you already have ("index_lcas_job_title_trigram"), possibly in combination with other indexes. The best solution depends on the complete picture.

Asides

You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?

You have some dubious data types:

employement_start_date | character varying
employement_end_date   | character varying

Seems like those should be date. Etc.

PostgreSQL Trigger Function – SELECT in Two Tables

You have an unresolved naming conflict.

You must be using an old version of Postgres without declaring it. Or you are operating with non-default configuration setting.

Here you declare a variable named measurement_id:

DECLARE
    measurement_id              INTEGER;

It's a folly to use ambiguous variable names to begin with. If you do it anyway, you must know what you are doing. I make it a habbit to prepend variable names with an underscore unlike column names, like _measurement_id.

The later SELECT statement is ambiguous:

ORDER BY measurement_id

This would raise an error message in modern PostgreSQL with default configuration. Per the documentation:

By default, PL/pgSQL will report an error if a name in a SQL statement could refer to either a variable or a table column.

And:

To change this behavior on a system-wide basis, set the configuration parameter plpgsql.variable_conflict to one of error, use_variable, or use_column (where error is the factory default). This parameter affects subsequent compilations of statements in PL/pgSQL functions, but not statements already compiled in the current session. Because changing this setting can cause unexpected changes in the behavior of PL/pgSQL functions, it can only be changed by a superuser.

In Postgres older than 9.0 this would be resolved to mean the variable. Per the documentation

In such cases you can specify that PL/pgSQL should resolve ambiguous references as the variable (which is compatible with PL/pgSQL's behavior before PostgreSQL 9.0)

Bold emphasis mine.
This would result in arbitrary results, since the sort order is now undetermined.

Audited Function

CREATE OR REPLACE FUNCTION insaft_function()
   RETURNS TRIGGER AS
$func$
DECLARE
   _measurement_id          integer;
   _file_header_index_start integer := TG_ARGV[0]::int;
   _file_header_index_end   integer := TG_ARGV[1]::int; 
BEGIN     

   SELECT a.measurement_id   INTO _measurement_id
   FROM   measurements a
   ORDER  BY a.measurement_id DESC  -- you had ambiguity here!
   LIMIT  1;

   IF TG_OP = 'INSERT' THEN  -- noise if only used in AFTER INSERT trigger
      INSERT INTO file_headers (measurement_id, file_header_index_start
                                              , file_header_index_end)
      VALUES (_measurement_id, _file_header_index_start, _file_header_index_end); 
   END IF;

   RETURN NULL; -- result is ignored since this is an AFTER trigger
END
$func$ LANGUAGE plpgsql;

Note how I named it insaft_function(), since this is only to be used in an AFTER INSERT trigger.

Trigger:

CREATE TRIGGER insaft_measurement_ids
AFTER INSERT ON measurements
FOR EACH ROW EXECUTE PROCEDURE insaft_function(1, 666);

But for the provided setup, you can radically simplify the function:

CREATE OR REPLACE FUNCTION insaft_function()
   RETURNS TRIGGER AS
$func$
BEGIN     
   INSERT INTO file_headers (measurement_id, file_header_index_start
                                           , file_header_index_end)
   VALUES (NEW.measurement_id, TG_ARGV[0]::int, TG_ARGV[1]::int);

   RETURN NULL;  -- result ignored since this is an AFTER trigger
END
$func$ LANGUAGE plpgsql;

Best Answer

Related Solutions

Best Index for Similarity Function in PostgreSQL

Similarity operator %