PostgreSQL Pagination – How to Do Pagination with UUID v4 and Created Time on Concurrent Inserted Data

pagingpostgresql

Context:

Out of curiosity, I'm doing load testing for my application. And then the result there's a lot of concurrent inserts happened.

After doing the load testing on create-endpoint, I'm trying to do load testing on the Fetch endpoint, including testing the pagination. For the pagination, I'm combining two columns, id (PK with UUID v4) and created_time. Also, I've added an index for faster sorting.
I'm following these solutions from here.

Problem:

Since the data was inserted concurrently, there are a few rows that have the same created_time, in my case up to 100(rows) in the same timestamp.

This is my table schema, an example

BEGIN;

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

DROP TABLE IF EXISTS "payment_with_uuid";

CREATE TABLE "payment_with_uuid" (
 id VARCHAR(255) PRIMARY KEY NOT NULL DEFAULT (uuid_generate_v4()),
 amount integer NULL,
 name varchar(255) default NULL,
 created_time TIMESTAMPTZ NOT NULL DEFAULT (now() AT TIME ZONE 'utc')
);

CREATE INDEX idx_payment_pagination ON payment_with_uuid (created_time, id);

COMMIT;

This is my query,

SELECT  * from payment_with_uuid ORDER BY  created_time DESC, id DESC LIMIT 10;

It will return 10 rows of payment, assume the data will look like this, and assume the timestamp is same until the 100th row

+-------------------------------------+--------+------------+---------------------+
| id                                  | amount | name       | created_time        |
+-------------------------------------+--------+------------+---------------------+
| ffffa567-e95a-4c8b-826c-e2be6acaeb6d| 32003  | Allistair  | 2020-05-24 21:27:10 | 
| ffff2dd6-3872-4acc-afec-7a568935f729| 32003  | James      | 2020-05-24 21:27:10 | 
| fffe3477-1710-45c4-b554-b539a9ee8fa7| 32003  | Kane       | 2020-05-24 21:27:10 |

And for fetching the next page, this is my query looks like,

SELECT * FROM payment_with_uuid 
WHERE 
created_time <= '2020-05-24 21:27:10' :: timestamp
AND 
id <'fffe3477-1710-45c4-b554-b539a9ee8fa7' 
ORDER BY created_time DESC, id DESC LIMIT 10;

And because of that, the pagination messed up, like some records that exist on the 1st page, may exist on 2nd, or 3rd, or any pages. And sometimes the records are missing.

Questions and Notes:

Is there any way to do this in a more elegant way?
I know using auto-increment will solve this, but choosing auto-increment id is not an option for us, because we're trying to make everything is consistent across microservice, many services already using UUID as the PK.
Using offset and limit will also solve this, but it's not a good practice as far as I know as this article explained https://use-the-index-luke.com/no-offset
I'm using Postgres 11.4

Best Answer

SELECT * FROM payment_with_uuid 
WHERE 
created_time <= '2020-05-24 21:27:10' :: timestamp
AND 
id <'fffe3477-1710-45c4-b554-b539a9ee8fa7' 
ORDER BY created_time DESC, id DESC LIMIT 10;

This is wrong, but it shouldn't lead to the problem you indicate of the same row showing up on page 1, 2, etc. Rather it would result in most rows failing to show up at all, because the two filters are implemented independently. You need to implement the id filter only within ties of the created_time filter. Elegance I guess is a matter of opinion, but it seems to me that the most elegant solution is the tuple comparator similar to what you had attempted to include in your original question.

SELECT * FROM payment_with_uuid 
WHERE 
(created_time,id) < ('2020-05-24 21:27:10' :: timestamp, 'fffe3477-1710-45c4-b554-b539a9ee8fa7') 
ORDER BY created_time DESC, id DESC LIMIT 10;

Now the timestamp should really be exact, it doesn't look like yours is. How is it getting rounded to the nearest second? In my hands it looks more like 2020-05-25 09:16:29.380925-04

If for some reason you don't want to use the tuple comparator, then you need to include the timestamp twice, once for less than and once for equal to:

WHERE 
created_time < '2020-05-24 21:27:10' :: timestamp
OR  
(
    created_time = '2020-05-24 21:27:10' :: timestamp 
    AND 
    id <'fffe3477-1710-45c4-b554-b539a9ee8fa7' 
)

In addition to not being very elegant, this will probably not use the index very effectively. You could use boolean reasoning to re-write it to avoid that top-level OR, so that it can use the index, but then it will get even harder to read and understand.

Related Solutions

Postgresql – Primary key with randomly varying increments (so it cannot be guessed easily)

I suggest a function taking a regclass parameter that runs ALTER SEQUENCE with a new randomly generated increment before it returns the next value from a given sequence.
Can be used as drop-in replacement for nextval().

Per documentation on ALTER SEQUENCE:

increment

The clause INCREMENT BY increment is optional. A positive value will make an ascending sequence, a negative one a descending sequence. If unspecified, the old increment value will be maintained.

However:

You must own the sequence to use ALTER SEQUENCE.

So we need to take care of privileges. You could make the function SECURITY DEFINER and owned by a superuser. If you don't REVOKE privileges from public it works for anyone on any sequence. There are two basic strategies to restrict usage:

To allow for selected sequences only, change the owner of those sequences to some dedicated role, say randseq and make randseq own the function (still with SECURITY DEFINER).
To allow for selected roles only, REVOKE all privileges on the function from public and GRANT EXECUTE to said roles. You might use a group role to simplify privilege management.

Or combine both:

CREATE OR REPLACE FUNCTION nextval_rand(regclass)
  RETURNS int AS
$func$
BEGIN
EXECUTE format('ALTER SEQUENCE %s INCREMENT %s'
               , $1                          -- regclass automatically sanitized
               , (random() * 100)::int + 1); -- values between 1 and 100
RETURN nextval($1)::int;
END
$func$ LANGUAGE plpgsql SECURITY DEFINER;

-- to restrict usage:
ALTER FUNCTION nextval_rand(regclass) OWNER TO randseq;
REVOKE ALL ON FUNCTION nextval_rand(regclass) FROM public;
GRANT EXECUTE ON FUNCTION nextval_rand(regclass) TO randseq;
GRANT randseq TO ???;

Call:

SELECT nextval_rand('tbl_tbl_id_seq'::regclass);

All you have to do now is replace nextval() with nextval_rand() in the column default of any serial column. And possibly change the owner of the sequence.

ALTER SEQUENCE tbl_tbl_id_seq OWNER TO randseq;
ALTER TABLE tbl ALTER COLUMN tbl_id SET DEFAULT nextval_rand('tbl_tbl_id_seq'::regclass);

Notes

ALTER SEQUENCE is designed not to block concurrent transactions. It takes effect immediately and cannot be rolled back. It should work reliably in a multi-user environment. Read the Notes section of the manual page for the fine print of ALTER SEQUENCE behavior.

There is a very slim chance for a race condition, where two concurrent operations each run ALTER SEQUENCE before calling nextval(). Since we are operating with random numbers anyway, this really doesn't matter.

Since we are running dynamic SQL I would normally SET search_path = public, pg_temp for the function. But since the parameter is regclass, only valid sequence names can be passed and are automatically schema-qualified and escaped unambiguously.

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Would a different kind of column be faster? For example an integer

No. timestamp and timestamptz are just unsigned 64-bit integers internally anyway.

Is there some way to not lock the column?

It doesn't lock the column. It takes weak table lock that doesn't really block anything except DDL, and takes a row level lock on the row you're updating.

There is no way to prevent the row level lock. It exists because without it behaviour and ordering concurrent updates would be undefined. We don't like undefined behaviour in RDBMSs.

It only blocks concurrent updates of the same row anyway.

Any other tips to improve this?

Not with the detail provided. There's likely a better way to do what you're trying to do, but it'll probably involve taking a few steps back and looking for a different strategy for solving the underlying problem.

In the specific case of cache invalidation I think you might want to look into LISTEN and NOTIFY. Again though, there just isn't enough info here to go on.

Best Answer

Related Solutions

Postgresql – Primary key with randomly varying increments (so it cannot be guessed easily)

Notes

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Related Question