PostgreSQL How to optimize a query with ORDER BY and LIMIT 1

indexlimitsoptimizationorder-bypostgresql

I have the following PostgreSQL schema:

CREATE TABLE User (
    ID INTEGER PRIMARY KEY
);

CREATE TABLE BOX (
    ID INTEGER PRIMARY KEY 
);

CREATE SEQUENCE seq_item;

CREATE TABLE Item (
    ID INTEGER PRIMARY KEY DEFAULT nextval('seq_item'),
    SENDER INTEGER REFERENCES User(id),
    RECEIVER INTEGER REFERENCES User(id),
    INFO TEXT,
    BOX_ID INTEGER REFERENCES Box(id) NOT NULL,
    ARRIVAL TIMESTAMP
);

Its main use case is a typical producer/consumer scenario. Different users may insert an item in the database in a particular box for a particular user and each user can retrieve the topmost(this means the oldest) item in a box that is addressed to her/him. It more or less mimics the functionality of a queue on a database level.

More precisely, the most common operations are the following:

INSERT INTO ITEM(SENDER, RECEIVER, INFO, BOX_ID, ARRIVAL) 
VALUES (nsid, nrid, ncontent, nqid, ntime);

And retrieve commands based on a combination of either RECEIVER+SENDER or RECEIVER+BOX_ID:

SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND 
(i.BOX_ID=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;

and

SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND 
(i.SENDER=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;

The last two snippets are packed within a stored procedure.

I was wondering how to achieve best performance given this use case and knowing that the users will insert and retrieve somewhere between 50,000 and 500,000 items (however, the database is never expected to contain more than 100,000 items at a given point)?

EDIT

This is the EXPLAIN I get with for the SELECT statements no indexes:

Limit (cost=23.07..23.07 rows=1 width=35)
   -> Sort (cost=23.07..25.07 rows=799 width=35)
      Sort Key: ARRIVAL
      -> Seq Scan on Item i (cost=0.00..19.07 rows=799 width=35)
         Filter: (((RECEIVER = 1) OR (RECEIVER IS NULL)) AND (SENDER = 1))

The best EXPLAIN I get based on my understanding is when I put an index on the time(CREATE INDEX ind ON Item(ARRIVAL);):

Limit (cost=0.42..2.88 rows=1 width=35)
   -> Index Scan using ti on Item i (cost=0.42..5899.42 rows=2397 width=35)
      Filter: (((receiver = 2) OR (RECEIVER IS NULL)) AND (SENDER = 2))

In all of the cases without index on ARRIVAL I have to sort the table which seems to my inefficient. If I try to combine an index on ARRIVAL and RECEIVER/SENDER I get the same explanation, but slightly slower.

Is it correct to assume that a single index on ARRIVAL is the most efficient option?

Best Answer

A btree index on (sender,arrival) could help. That would allow it to jump directly to the first-arrived message for a given sender.

One on (arrival,sender) is less likely to help. That allows you to jump to the first-sent message globally, but then you still have to walk along those messages until you hit one from the specified sender. If that particular sender is new, or only sends messages to people who keep up on their inbox, then you might have to walk through most of the index before you find a qualifying message. It might help in that you only have to walk the index, not the index and table combined, but that would still be a much smaller win than the proper index of (sender,arrival).

Similarly, you would want another index on (box_id, arrival), not on (arrival, box_id).

Also, performance testing on a table with 800 rows is useless if the real table will have 100,000 rows.