PostgreSQL – Using varchar_pattern_ops in a Multi-Column Index

indexoptimizationpostgresql

I'm using postgres 9.5

If I have a table with 2 columns like so:

CREATE TABLE mystuff
(
  somestring character varying(256),
  timestamp_ timestamp without time zone NOT NULL
)

Will this multi-column btree index:

CREATE INDEX mystuff_idx
  ON mystuff
  USING btree
  (timestamp_ , somestring varchar_pattern_ops);

aid in performance of the following query ?

Select count(*)
FROM mystuff
where timestamp_ > '01/01/2012' and somestring like 'foo%'

what about ?

Select count(*)
FROM mystuff
where timestamp_ > '01/01/2012' and somestring like '%foobar'

For extra point please explain how a multi-column btree is used for lookup when there is a like clause in the second (or 3rd etc…) column of the index.

Best Answer

The two-column btree index will help with the like 'foo%' query, but probably not dramatically so. It helps because it can be executed as an index-only scan, and so it can compute the LIKE portion within in the index without ever having to visit the table. The index scan will jump to the first entry > '01/01/2012', and then traverse from there to the logical end of the index. At each entry it will test the LIKE condition. For the ones that pass that condition, it will check the visibility map to see if the table page holding that tuple is all visible. If it is all visible, it will increment the counter and move on. If not, it will have to visit the table page to see if the tuple is visible.

How much of a help this will be is hard to predict, as it depends on the size of the index, the table, your RAM, what mix of queries you are running (which changes which kind of data is likely to be found in the cache), among other things.

It will not help with the `like '%foobar' query. From the outside looking in, there is no reason it couldn't help in the same way. But PostgreSQL's index machinery just hasn't been made clever enough (yet) to recognize and implement that possibility.

Related Solutions

MySQL query optimizer ignoring smaller scan on TIMESTAMP column, cardinality

For these queries and if your WHERE is as you have shown and you also have ORDER BY rf_timestamp you can use this index, which should be far better than a single index on si_id or a single index on rf_timestamp:

ALTER TABLE rf
  ADD INDEX si_id__rf_timestamp__IX         -- choose a name for the index
    (si_id, rf_timestamp) ;

With a table of this size, adding this index will take some time and the table will be locked in the mean time, so it would be better if you did this when there is not much traffic and work by others in the database.

Postgresql table with one integer column, sorted index, with duplicate primary key

I think you're asking how to impliment a solution you'e already decided on for a more general problem you don't describe. If you were to outline the actual problem that this is supposed to solve you might get better suggestions about how to solve it.

Working within the very limited information provided:

Update: I found your other question, which you really should've linked to. You seem to be trying to roll your own message queue. Don't do that. Read these:

Have I convinced you that you shouldn't try to do this yourself yet? Look into:

RabbitMQ
ZeroMQ
Job::Machine
ActiveMQ
PGQ
Celery

Some of what you want isn't available in current PostgreSQL versions. For example:

INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between two other rows, ordered by ID.`

That'd require an index-organized table, which PostgreSQL doesn't have yet. The closest you'll get would be a one-column table with a PRIMARY KEY. With regular VACUUM on PostgreSQL 9.2 you'd be able to use index-only scans to access it most of the time.

As for allowing duplicates, you don't really seem to want to permit them at all, you're just saying you want to work around concurrency issues by temporarily permitting them.

You can remove such duplicates during INSERT so the table its self doesn't need to permit them. However, that'll cause issues with:

INSERTs will happen in bulk (about 1000 per transaction) and must not fail, expect for disc full, etc. There must not be any chance for deadlocks.

... assuming that those inserts occur concurrently from multiple transactions. You'll have races between the checks for existence and the insert that can cause insert batches to fail and have to be re-tried.

I suspect that your best bet is to have a one-column table without a PRIMARY KEY. Just create an ordinary b-tree index on it, and leave the table without a PRIMARY KEY. Since it genuinely has no primary key (the only column may have duplicates) this is entirely reasonable.

(BTW, given that SQL is supposedly all about sets, it astounds me how awful it is at "add this entry to the set if not already present").

Best Answer

Related Solutions

MySQL query optimizer ignoring smaller scan on TIMESTAMP column, cardinality

Postgresql table with one integer column, sorted index, with duplicate primary key

Related Question