Postgresql – Should I be concered by large SERIAL values

integerpostgresqlsequence

I have a Django application that uses PostgreSQL to analyze data from tweets. The data set increases by thousands of records with each request. I am using the database primarily as a cache, so I had planned to delete all records every 24 hours to permit new requests without needlessly increasing the size of the database.

Django uses the SERIAL type for storing the ids; a new item is given the next highest value of the last item that was created, rather than the first available number. I don't use the ids for anything outside of the ORM. My concern is that I will eventually run out of key space on my 32 bit VM. What does PostgreSQL when the next value is too large for SERIAL? Does it give error? Does it roll back to one?

Best Answer

TLDR: If you're going to have more than 2³¹ possible values change the column type to BIGSERIAL.

Explanation:

The SERIAL type is a signed 32-bit integer and is designed for values less than 2³¹ (NOTE: that's 2³¹, not 2³² as they're signed integers). Here's the snippet from the PostgreSQL docs:

The type names serial and serial4 are equivalent: both create integer columns. The type names bigserial and serial8 work the same way, except that they create a bigint column. bigserial should be used if you anticipate the use of more than 2³¹ identifiers over the lifetime of the table. The type names smallserial and serial2 also work the same way, except that they create a smallint column.

The behavior of how it handles overflowing past 2³¹ isn't mentioned in the docs but it clearly states that you shouldn't use it if you expect to have values greater than that.

SERIAL columns are sequence numbers. The server keeps track of the previous value (the prior max) and each time a value is requested the value is incremented. This incrementation can happen regardless of whether the id is actually permanently saved. If the transaction that fetched the id is rolled back then the id will be "lost" and your sequence will have holes in it (eg. gaps of unused ids). There's nothing wrong with that and if you're simply using them as unique ids then having gaps should cause no issue.

Sequences are implemented like this to be efficient. The server only needs to keep track of one value (the prior max) to uniquely generating ids, it can cache them across multiple connections to speed up sequence generation, and by allowing gaps it never needs to transactionally lock the sequence object when it's generating sequences. This makes them very efficient and concurrent.

Just switch your database table to use BIGSERIAL.

Related Solutions

Postgresql – Using MongoDB and PostgreSQL together

Some thoughts....

Typically one does not want to store pieces of tightly interrelated information in different systems. The chances of things getting out of sync is significant and now instead of one problem on your hands you have two. One thing you can do with Mongo though is use it to pipeline your data in or data out. My preference is to keep everything in PostgreSQL to the extent this is possible. However, I would note that doing so really requires expert knowledge of PostgreSQL programming and is not for shops unwilling to dedicate to using advanced features. I see a somewhat different set of options than you do. Since my preference is not something I see listed I will give it to you.

You can probably separate your metadata into common data, data required for classes, and document data. In this regard you would have a general catalog table with the basic common information plus one table per class. In this table you would have an hstore, json, or xml field which would store the rest of the data along with columns where you are storing data that must be constrained significantly. This would reduce what you need to put in these tables per class, but would allow you to leverage constraints however you like. The three options have different issues and are worth considering separately:

hstore is relatively limited but also used by a lot of people. It isn't extremely new but it only is a key/value store, and is incapable of nested data structures, unlike json and xml.

json is quite new and doesn't really do a lot right now. This doesn't mean you can't do a lot with it, but you aren't going to do a lot out of the box. If you do you can expect to do a significant amount of programming, probably in plv8js or, if you want to stick with older environments, plperlu or plpython. json is better supported in 9.3 though at least in current development snapshots, so when that version is released things will get better.

xml is the best supported of the three, with the most features, and the longest support history. Then again, it is XML.....

However if you do decide to go with Mongo and PostgreSQL together, note that PostgreSQL supports 2 phase commit meaning you can run the write operations, then issue PREPARE TRANSACTION and if this succeeds do your atomic writes in Mongo. If that succeeds you can then COMMIT in PostgreSQL.

Postgresql – How to change table/indexes structure for a query with two aggregated columns

As suggested by @Mihai, adding an index that include docid as first column help. The best solution is to create two of them, one for (docid,creatingDate) and the second one for (docid,lastChangeDate). Please note for simplyfing the index scan, the second fild is ordered DESCENDING, infact the first two indexes are ignored. The new plan is now:

=# create index testtesttest1 on agenzia.systemdatabydocument (docid,creatingDate);
CREATE INDEX
=# create index testtesttest2 on agenzia.systemdatabydocument (docid,lastChangeDate);
CREATE INDEX
=# create index testtesttest3 on agenzia.systemdatabydocument (docid,creatingDate DESC);
CREATE INDEX
=# create index testtesttest4 on agenzia.systemdatabydocument (docid,lastChangeDate DESC);
CREATE INDEX
=# explain analyze SELECT MAX(creatingDate), MAX(lastChangeDate) FROM agenzia.SystemDataByDocument WHERE docId = 'Sedi';
                                                                  QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=5.40..5.41 rows=1 width=0) (actual time=0.151..0.151 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..2.70 rows=1 width=4) (actual time=0.091..0.091 rows=0 loops=1)
       ->  Index Scan using testtesttest3 on systemdatabydocument  (cost=0.00..1372.58 rows=508 width=4) (actual time=0.085..0.085 rows=0 loops=1)
             Index Cond: (((docid)::text = 'Sedi'::text) AND (creatingdate IS NOT NULL))
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.00..2.70 rows=1 width=4) (actual time=0.054..0.054 rows=0 loops=1)
       ->  Index Scan using testtesttest4 on systemdatabydocument  (cost=0.00..1372.58 rows=508 width=4) (actual time=0.053..0.053 rows=0 loops=1)
             Index Cond: (((docid)::text = 'Sedi'::text) AND (lastchangedate IS NOT NULL))
 Total runtime: 0.256 ms

Best Answer

Related Solutions

Postgresql – Using MongoDB and PostgreSQL together

Postgresql – How to change table/indexes structure for a query with two aggregated columns

Related Question