Postgresql – Should I be concered by large SERIAL values

integerpostgresqlsequence

I have a Django application that uses PostgreSQL to analyze data from tweets. The data set increases by thousands of records with each request. I am using the database primarily as a cache, so I had planned to delete all records every 24 hours to permit new requests without needlessly increasing the size of the database.

Django uses the SERIAL type for storing the ids; a new item is given the next highest value of the last item that was created, rather than the first available number. I don't use the ids for anything outside of the ORM. My concern is that I will eventually run out of key space on my 32 bit VM. What does PostgreSQL when the next value is too large for SERIAL? Does it give error? Does it roll back to one?

Best Answer

TLDR: If you're going to have more than 231 possible values change the column type to BIGSERIAL.

Explanation:

The SERIAL type is a signed 32-bit integer and is designed for values less than 231 (NOTE: that's 231, not 232 as they're signed integers). Here's the snippet from the PostgreSQL docs:

The type names serial and serial4 are equivalent: both create integer columns. The type names bigserial and serial8 work the same way, except that they create a bigint column. bigserial should be used if you anticipate the use of more than 231 identifiers over the lifetime of the table. The type names smallserial and serial2 also work the same way, except that they create a smallint column.

The behavior of how it handles overflowing past 231 isn't mentioned in the docs but it clearly states that you shouldn't use it if you expect to have values greater than that.

SERIAL columns are sequence numbers. The server keeps track of the previous value (the prior max) and each time a value is requested the value is incremented. This incrementation can happen regardless of whether the id is actually permanently saved. If the transaction that fetched the id is rolled back then the id will be "lost" and your sequence will have holes in it (eg. gaps of unused ids). There's nothing wrong with that and if you're simply using them as unique ids then having gaps should cause no issue.

Sequences are implemented like this to be efficient. The server only needs to keep track of one value (the prior max) to uniquely generating ids, it can cache them across multiple connections to speed up sequence generation, and by allowing gaps it never needs to transactionally lock the sequence object when it's generating sequences. This makes them very efficient and concurrent.

Just switch your database table to use BIGSERIAL.