You forgot to mention that you installed the additional module pg_trgm
, which provides the similarity()
function.
First of all, whatever else you do, use the similarity operator %
instead of the expression (similarity(job_title, 'sales executive') > 0.6)
. Much cheaper. And index support is bound to operators in Postgres, not to functions.
To get the desired minimum similarity of 0.6
, run:
SELECT set_limit(0.6);
The setting stays for the rest of your session unless reset to something else. Check with:
SELECT show_limit();
This is a bit clumsy, but great for performance.
Simple case
If you just wanted the best matches in column job_title
for the string 'sales executive' then this would be a simple case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator class gist_trgm_ops
(but not with a GIN index):
CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);
To also include an equality condition on worksite_city
you would need the additional module btree_gist
. Run (once per DB):
CREATE EXTENSION btree_gist;
Then:
CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);
Query:
SELECT set_limit(0.6); -- once per session
SELECT *
FROM lca_test
WHERE job_title % 'sales executive'
AND worksite_city = 'los angeles'
ORDER BY (job_title <-> 'sales executive')
LIMIT 50;
<->
being the "distance" operator:
one minus the similarity()
value.
Postgres can also combine two separate indexes, a plain btree index on worksite_city
, and a separate GiST index on job_title
, but the multicolumn index should be fastest - if you combine the two columns like this in queries regularly.
Your case
However, your query sorts by salary
, not by distance / similarity, which changes the nature of the game completely. Now we can use both GIN and GiST index, and GIN will be faster (even more so in Postgres 9.4 which has largely improved GIN indexes - hint!)
Similar story for the additional equality check on worksite_city
: install the additional module btree_gin
. Run (once per DB):
CREATE EXTENSION btree_gin;
Then:
CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);
Query:
SELECT set_limit(0.6); -- once per session
SELECT *
FROM lca_test
WHERE job_title % 'sales executive'
AND worksite_city = 'los angeles'
ORDER BY salary
LIMIT 50 -- OFFSET 0
Again, this should also work (less efficiently) with the simpler index you already have ("index_lcas_job_title_trigram"
), possibly in combination with other indexes. The best solution depends on the complete picture.
Asides
You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?
You have some dubious data types:
employement_start_date | character varying
employement_end_date | character varying
Seems like those should be date
. Etc.
Related answers:
You have an unresolved naming conflict.
You must be using an old version of Postgres without declaring it. Or you are operating with non-default configuration setting.
Here you declare a variable named measurement_id
:
DECLARE
measurement_id INTEGER;
It's a folly to use ambiguous variable names to begin with. If you do it anyway, you must know what you are doing. I make it a habbit to prepend variable names with an underscore unlike column names, like _measurement_id
.
The later SELECT
statement is ambiguous:
ORDER BY measurement_id
This would raise an error message in modern PostgreSQL with default configuration. Per the documentation:
By default, PL/pgSQL will report an error if a name in a SQL statement
could refer to either a variable or a table column.
And:
To change this behavior on a system-wide basis, set the configuration
parameter plpgsql.variable_conflict
to one of error, use_variable, or
use_column (where error is the factory default). This parameter
affects subsequent compilations of statements in PL/pgSQL functions,
but not statements already compiled in the current session. Because
changing this setting can cause unexpected changes in the behavior of
PL/pgSQL functions, it can only be changed by a superuser.
In Postgres older than 9.0 this would be resolved to mean the variable. Per the documentation
In such cases you can specify that PL/pgSQL should resolve ambiguous
references as the variable (which is compatible with PL/pgSQL's
behavior before PostgreSQL 9.0)
Bold emphasis mine.
This would result in arbitrary results, since the sort order is now undetermined.
Audited Function
CREATE OR REPLACE FUNCTION insaft_function()
RETURNS TRIGGER AS
$func$
DECLARE
_measurement_id integer;
_file_header_index_start integer := TG_ARGV[0]::int;
_file_header_index_end integer := TG_ARGV[1]::int;
BEGIN
SELECT a.measurement_id INTO _measurement_id
FROM measurements a
ORDER BY a.measurement_id DESC -- you had ambiguity here!
LIMIT 1;
IF TG_OP = 'INSERT' THEN -- noise if only used in AFTER INSERT trigger
INSERT INTO file_headers (measurement_id, file_header_index_start
, file_header_index_end)
VALUES (_measurement_id, _file_header_index_start, _file_header_index_end);
END IF;
RETURN NULL; -- result is ignored since this is an AFTER trigger
END
$func$ LANGUAGE plpgsql;
Note how I named it insaft_function()
, since this is only to be used in an AFTER INSERT
trigger.
Trigger:
CREATE TRIGGER insaft_measurement_ids
AFTER INSERT ON measurements
FOR EACH ROW EXECUTE PROCEDURE insaft_function(1, 666);
But for the provided setup, you can radically simplify the function:
CREATE OR REPLACE FUNCTION insaft_function()
RETURNS TRIGGER AS
$func$
BEGIN
INSERT INTO file_headers (measurement_id, file_header_index_start
, file_header_index_end)
VALUES (NEW.measurement_id, TG_ARGV[0]::int, TG_ARGV[1]::int);
RETURN NULL; -- result ignored since this is an AFTER trigger
END
$func$ LANGUAGE plpgsql;
Best Answer
Because this is valid SQL.
First of all consider that it is valid to run a
SELECT
against the table that doesn't reference any columns from that table.The above will return a single row result of
foo
if any rows exist that match theWHERE
clause.Secondly consider that it is valid to reference a column from the outer table in a sub query (this is required for correlated sub queries to work).
In the event that the column name is not qualified with a table reference then it will be resolved in the inner scope if possible and outer scopes if not. In your case
product_template
has no such column so it was resolved as belonging toproduct_product
.As a best practice be explicit about the tables the columns belong to. Had the subquery been written as below it would have failed to compile and alerted you of the error.
This depends. If
FROM product_template WHERE type = 'import';
returns zero rows then you were lucky and this is equivalent toProbably you weren't lucky though and it did return at least one row. In this case you ran the equivalent of the following
which is equivalent to
I imagine as firstly it updated all rows in the table.
I'm not sure what the execution plans for this would be like in Postgres too.
In the worst case it might have been selecting all rows matching the
where
fromproduct_template
, passing in the correlated parameter, then performingDISTINCT
on the result for each row in the outerproduct_product
table.