Core feature is the window function lag()
.
Also pay special attention to avoid deadlocks and race conditions with concurrent deletes and inserts (which can affect which rows to delete!):
CREATE OR REPLACE FUNCTION remove_vendor_price_dupes(_vendor int)
RETURNS integer AS
$func$
DECLARE
del_ct int;
BEGIN
-- this may or may not be necessary:
-- lock rows to avoid race conditions with concurrent deletes
PERFORM 1
FROM vendor_prices
WHERE vendor = _vendor
ORDER BY sku, effective_date, id -- guarantee row locks in consistent order
FOR UPDATE;
-- delete redundant prices
DELETE FROM vendor_prices v
USING (
SELECT id
, price = lag(price) OVER w -- same as last row
AND (lead(id) OVER w) IS NOT NULL AS del -- not last row
FROM vendor_prices
WHERE vendor = _vendor
WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id)
) d
WHERE v.id = d.id
AND d.del;
GET DIAGNOSTICS del_ct = ROW_COUNT; -- optional:
RETURN del_ct; -- return number of deleted rows
END
$func$ LANGUAGE plpgsql;
Call:
SELECT remove_vendor_price_dupes(1);
Notes
The current version of the 9.3 major release is 9.3.6. The project recommends that ...
all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id)
would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price
as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT
(PERFORM
in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE
cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku
is unique per product, so we can PARTITION BY
it.
ORDER BY effective_date, id
: your first version of the question included code for duplicate rows, so I added id to ORDER BY
as additional tie breaker. This way it works for duplicates on (sku, effective_date)
as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL
. Reusing the same window for lead()
is cheap - independent of the added explicit WINDOW
clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id
. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void
.
You forgot to mention that you installed the additional module pg_trgm
, which provides the similarity()
function.
First of all, whatever else you do, use the similarity operator %
instead of the expression (similarity(job_title, 'sales executive') > 0.6)
. Much cheaper. And index support is bound to operators in Postgres, not to functions.
To get the desired minimum similarity of 0.6
, run:
SELECT set_limit(0.6);
The setting stays for the rest of your session unless reset to something else. Check with:
SELECT show_limit();
This is a bit clumsy, but great for performance.
Simple case
If you just wanted the best matches in column job_title
for the string 'sales executive' then this would be a simple case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator class gist_trgm_ops
(but not with a GIN index):
CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);
To also include an equality condition on worksite_city
you would need the additional module btree_gist
. Run (once per DB):
CREATE EXTENSION btree_gist;
Then:
CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);
Query:
SELECT set_limit(0.6); -- once per session
SELECT *
FROM lca_test
WHERE job_title % 'sales executive'
AND worksite_city = 'los angeles'
ORDER BY (job_title <-> 'sales executive')
LIMIT 50;
<->
being the "distance" operator:
one minus the similarity()
value.
Postgres can also combine two separate indexes, a plain btree index on worksite_city
, and a separate GiST index on job_title
, but the multicolumn index should be fastest - if you combine the two columns like this in queries regularly.
Your case
However, your query sorts by salary
, not by distance / similarity, which changes the nature of the game completely. Now we can use both GIN and GiST index, and GIN will be faster (even more so in Postgres 9.4 which has largely improved GIN indexes - hint!)
Similar story for the additional equality check on worksite_city
: install the additional module btree_gin
. Run (once per DB):
CREATE EXTENSION btree_gin;
Then:
CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);
Query:
SELECT set_limit(0.6); -- once per session
SELECT *
FROM lca_test
WHERE job_title % 'sales executive'
AND worksite_city = 'los angeles'
ORDER BY salary
LIMIT 50 -- OFFSET 0
Again, this should also work (less efficiently) with the simpler index you already have ("index_lcas_job_title_trigram"
), possibly in combination with other indexes. The best solution depends on the complete picture.
Asides
You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?
You have some dubious data types:
employement_start_date | character varying
employement_end_date | character varying
Seems like those should be date
. Etc.
Related answers:
Best Answer
What you see is completely normal and expected.
Unless you really need a gapless series of values, you don't have to worry about it. This is how most (well over 99%, I guess) database tables work.
If you need it, be careful if you have concurrent inserts on the table - you will need some locking to exclude the possibility of choosing the same value by two or more sessions running in parallel.