Postgresql table with one integer column, sorted index, with duplicate primary key

indexorder-byperformancepostgresqlprimary-key

I want to have a table in PostgreSQL with these properties:

One column with an integer
Allow duplicates in this column. Duplicates will be rare. Duplicates have no meaning logically and will be deleted by a WHERE clause on the integer. The table is used like a set.
Efficient ORDER BY+LIMIT on the column/rows, and DELETING with a WHERE-clause for that integer
INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between two other rows, ordered by ID. (That's why duplicates must be allowed, the check for primary keys would cause more disc seeks and can fail transactions and cause deadlocks)
INSERTs will happen in bulk (about 1000 per transaction) and must not fail, expect for disc full, etc. There must not be any chance for deadlocks.
There shall not be additional btree files for this table, so no secondary indexes
The rows should occupy not much space, e.g. have no OIDs, so that many fit in one page.

I cannot think of a solution that solves all of this.

The easiest thing would be a PRIMARY INDEX on that one column. But this won't allow duplicates.

Also I could use a secondary index, but this would cause more operations per INSERT and more disc access. I am really bound by disc seeks here.

Currently my best solution would compromise on the last bullet point: I could just a PRIMARY KEY covering the integer column and also a dummy column, like an OID, a timestamp column or a SERIAL column. Then every row (=primary key) would be unique. But this would increase disc space and reduce the amount of rows per page.

Is there any chance of having a kind of primary key together with duplicates? Can I control what the main btree file on disc is organized by?

(I posted a related question like this question on StackOverflow, but I was very unclear about my goals. I hope this doesn't count as spam, this question here is completely different)

Best Answer

I think you're asking how to impliment a solution you'e already decided on for a more general problem you don't describe. If you were to outline the actual problem that this is supposed to solve you might get better suggestions about how to solve it.

Working within the very limited information provided:

Update: I found your other question, which you really should've linked to. You seem to be trying to roll your own message queue. Don't do that. Read these:

Have I convinced you that you shouldn't try to do this yourself yet? Look into:

RabbitMQ
ZeroMQ
Job::Machine
ActiveMQ
PGQ
Celery

Some of what you want isn't available in current PostgreSQL versions. For example:

INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between two other rows, ordered by ID.`

That'd require an index-organized table, which PostgreSQL doesn't have yet. The closest you'll get would be a one-column table with a PRIMARY KEY. With regular VACUUM on PostgreSQL 9.2 you'd be able to use index-only scans to access it most of the time.

As for allowing duplicates, you don't really seem to want to permit them at all, you're just saying you want to work around concurrency issues by temporarily permitting them.

You can remove such duplicates during INSERT so the table its self doesn't need to permit them. However, that'll cause issues with:

INSERTs will happen in bulk (about 1000 per transaction) and must not fail, expect for disc full, etc. There must not be any chance for deadlocks.

... assuming that those inserts occur concurrently from multiple transactions. You'll have races between the checks for existence and the insert that can cause insert batches to fail and have to be re-tried.

I suspect that your best bet is to have a one-column table without a PRIMARY KEY. Just create an ordinary b-tree index on it, and leave the table without a PRIMARY KEY. Since it genuinely has no primary key (the only column may have duplicates) this is entirely reasonable.

(BTW, given that SQL is supposedly all about sets, it astounds me how awful it is at "add this entry to the set if not already present").

Related Solutions

Mysql – Primary key index with a DATETIME as first part of the compound key is never used

This is a bug in 5.5.x. See here

That suggests that your query should be

SELECT * 
FROM `stats`  
WHERE 
   dateDim = CAST('2014-04-03 00:00:00' as datetime)
   AND accountDim = 4
   AND execCodeDim = 9
   AND operationTypeDim = 1
   AND junkDim = 5
   AND ipCountryDim = 3

Postgresql – Postgres primary key sorted in the reverse order. Will it use the index efficiently

Indexes are typically unidirectional. This allows improvements in space and index scanning speed as back links don't need to be maintained. As @a_horse_with_no_name has noted, Postgres can and will be used to search in the reverse order to the index. (If this was not a unique index with no NULLs, positioning of NULLS FIRST/LAST could be a consideration.) Performance may not be as fast as using a non-standard index ordering, and the documentation notes significant speedups can be obtained with non-standard orderings.

For systems that don't support scanning indexes backwards, if your index is ascending, it would not be used for your query. There are two options in this case:

add another index with DESC specified; or
replace the primary key index with one which is has DESC specified. (I haven't tried this, and likely wouldn't as it breaks my expectations for primary keys.)

Best Answer

Related Solutions

Mysql – Primary key index with a DATETIME as first part of the compound key is never used

Postgresql – Postgres primary key sorted in the reverse order. Will it use the index efficiently

Related Question