Postgresql – Challenges with Key with Gaps

errorspostgresqlserialization

I had a mistake with a trigger which resulted in errors in PostgreSQL 9.4.3 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit.
Each time during the error, the primary key of SERIAL increased.
After fixing the bug, the table measurements is

 measurement_id | measurement_size_in_bytes |             time              
----------------+---------------------------+-------------------------------
              1 |                     77777 | 2015-07-14 18:29:56.858703+03
              2 |                       888 | 2015-07-14 18:29:56.882552+03
              3 |                       888 | 2015-07-14 18:30:15.505957+03
              4 |                       888 | 2015-07-14 18:41:01.878106+03
             39 |                     77777 | 2015-07-15 12:11:21.21391+03
             40 |                     77777 | 2015-07-15 12:11:59.551973+03
             41 |                     77777 | 2015-07-15 12:12:05.48982+03
             42 |                     77777 | 2015-07-15 12:13:02.402053+03
             43 |                     77777 | 2015-07-15 12:13:02.419412+03
             44 |                       888 | 2015-07-15 12:13:02.434728+03
         ...

where the amount of error statements has been 35 (= 39-4).
This jump in IDs may become a challenge later when I want to integrate this table with a partial index where order matters.

Table

CREATE TABLE measurements
    (
        measurement_id SERIAL PRIMARY KEY NOT NULL,
        measurement_size_in_bytes INTEGER NOT NULL,
        time TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
        CONSTRAINT no_duplicate_measurements UNIQUE (time)
    );

What are the challenges of keys with gaps?

Best Answer

What you see is completely normal and expected.

Note: Because smallserial, serial and bigserial are implemented using sequences, there may be "holes" or gaps in the sequence of values which appears in the column, even if no rows are ever deleted. A value allocated from the sequence is still "used up" even if a row containing that value is never successfully inserted into the table column. This may happen, for example, if the inserting transaction rolls back.

Unless you really need a gapless series of values, you don't have to worry about it. This is how most (well over 99%, I guess) database tables work.

If you need it, be careful if you have concurrent inserts on the table - you will need some locking to exclude the possibility of choosing the same value by two or more sessions running in parallel.

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

Postgresql – Best index for similarity function

You forgot to mention that you installed the additional module pg_trgm, which provides the similarity() function.

Similarity operator `%`

First of all, whatever else you do, use the similarity operator % instead of the expression (similarity(job_title, 'sales executive') > 0.6). Much cheaper. And index support is bound to operators in Postgres, not to functions.

To get the desired minimum similarity of 0.6, run:

SELECT set_limit(0.6);

The setting stays for the rest of your session unless reset to something else. Check with:

SELECT show_limit();

This is a bit clumsy, but great for performance.

Simple case

If you just wanted the best matches in column job_title for the string 'sales executive' then this would be a simple case of "nearest neighbor" search and could be solved with a GiST index using the trigram operator class gist_trgm_ops (but not with a GIN index):

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);

To also include an equality condition on worksite_city you would need the additional module btree_gist. Run (once per DB):

CREATE EXTENSION btree_gist;

Then:

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);

Query:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY (job_title <-> 'sales executive')
LIMIT  50;

<-> being the "distance" operator:

one minus the similarity() value.

Postgres can also combine two separate indexes, a plain btree index on worksite_city, and a separate GiST index on job_title, but the multicolumn index should be fastest - if you combine the two columns like this in queries regularly.

Your case

However, your query sorts by salary, not by distance / similarity, which changes the nature of the game completely. Now we can use both GIN and GiST index, and GIN will be faster (even more so in Postgres 9.4 which has largely improved GIN indexes - hint!)

Similar story for the additional equality check on worksite_city: install the additional module btree_gin. Run (once per DB):

CREATE EXTENSION btree_gin;

Then:

CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);

Query:

SELECT set_limit(0.6);  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY salary 
LIMIT  50 -- OFFSET 0

Again, this should also work (less efficiently) with the simpler index you already have ("index_lcas_job_title_trigram"), possibly in combination with other indexes. The best solution depends on the complete picture.

Asides

You have a lot of indexes. Are you sure they are all in use and pay their maintenance cost?

You have some dubious data types:

employement_start_date | character varying
employement_end_date   | character varying

Seems like those should be date. Etc.

Best Answer

Related Solutions

Postgresql – Delete duplicate records with no change in between