Postgresql – Does Postgres offer a feature like “NEWSEQUENTIALID” in MS SQL Server to make UUID as primary key more efficient

indexpostgresqlprimary-keyuuid

Microsoft SQL Server offers the NEWID command to generate a new GUID (the Microsoft version of UUID) value that can be used as a primary key value (in their uniqueidentifier data type). These are not sequential in nature, so updating an index can be inefficient.

Alternatively, MS SQL Server offers the NEWSEQUENTIALID command. To quote their documentation:

Creates a GUID that is greater than any GUID previously generated by this function on a specified computer since Windows was started. After restarting Windows, the GUID can start again from a lower range, but is still globally unique. When a GUID column is used as a row identifier, using NEWSEQUENTIALID can be faster than using the NEWID function. This is because the NEWID function causes random activity and uses fewer cached data pages. Using NEWSEQUENTIALID also helps to completely fill the data and index pages.

Is there a way to get the more efficiently-indexed UUID in Postgres?

Best Answer

`uuid-ossp` module

PostgreSQL uses the standardized UUID generation algorithms provided by ITU-T Rec. X.667, ISO/IEC 9834-8:2005, and RFC 4122. From the docs on uuid-ossp,

The uuid-ossp module provides functions to generate universally unique identifiers (UUIDs) using one of several standard algorithms. There are also functions to produce certain special UUID constants.

uuid_generate_v1() This function generates a version 1 UUID. This involves the MAC address of the computer and a time stamp. Note that UUIDs of this kind reveal the identity of the computer that created the identifier and the time at which it did so, which might make it unsuitable for certain security-sensitive applications.

So long as the MAC address does not change, you'll be golden.

That all said, I agree with @a_horse_with_no_name,

From my understanding this is only necessary in SQL Server because tables are stored in a clustered index which makes random insertions slower then with a heap table. Postgres has no such concept, so I don't think that would make a difference in Postgres

In fact, given the chance of fewer collisions and more security, I would take it. And to that I would use uuid_generate_v4()

uuid_generate_v4() This function generates a version 4 UUID, which is derived entirely from random numbers.

Related Solutions

Postgresql – How to achieve clustering of rows without the exclusive lock and logging overhead of the `cluster` command

You can do this without using the cluster command and having the table locked or generating WAL for the whole table. The cost is that you need to full-scan the table regularly.

The basic idea is:

turn off autovacuum for the table
check each block to determine the degree of clustering
delete and re-insert all the rows from blocks below a clustering threshold
manually vacuum to free those (complete) blocks
repeat steps 2-4 as regularly as necessary

test schema sample data initially 'part-clustered':

create schema stack;
set search_path=stack;
create type t_tid as (blkno bigint, rowno integer);
create table foo(host_id integer, bar text default repeat('a',400)) with (autovacuum_enabled=false);
insert into foo(host_id) select mod(g,10) from generate_series(1,500000) g order by mod(g,10);
insert into foo(host_id) select mod(g,10) from generate_series(1,500000) g;
create index nu_foo on foo(host_id);

initial clustering statistics:

select cn, count(*)
from ( select count(*) cn
       from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
       group by blkno ) z
group by cn
order by cn;
/*
 cn | count
----+-------
  1 | 27769  <---- half clustered
  2 |     8
  5 |     1
 10 | 27778  <---- half un-clustered
*/
select count(distinct (ctid::text::t_tid).blkno) from foo where host_id=1;
/*
 count
-------
 30558  <--------- lots of blocks to read for `host_id=1`
*/

initial analyze (2146.503 ms):

explain analyze select count(bar) from foo where host_id=1;
/*
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=15097.30..15097.31 rows=1 width=32) (actual time=2146.157..2146.158 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=95.17..15084.80 rows=5000 width=32) (actual time=21.586..2092.379 rows=100000 loops=1)
         Recheck Cond: (host_id = 1)
         Rows Removed by Index Recheck: 286610
         ->  Bitmap Index Scan on nu_foo  (cost=0.00..93.92 rows=5000 width=0) (actual time=19.232..19.232 rows=100000 loops=1)
               Index Cond: (host_id = 1)
 Total runtime: 2146.503 ms
*/

delete and re-insert the un-clustered rows:

with w as ( select blkno
            from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
            group by blkno
            having count(*)>2 )
   , d as ( delete from foo
            where (ctid::text::t_tid).blkno in (select blkno from w)
            returning * )
insert into foo(host_id,bar) select host_id,bar from d order by host_id;
--
vacuum foo;

new clustering statistics:

select cn, count(*)
from ( select count(*) cn
       from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
       group by blkno ) z
group by cn
order by cn;
/*
 cn | count
----+-------
  1 | 55541  <---- fully clustered
  2 |    16
*/
select count(distinct (ctid::text::t_tid).blkno) from foo where host_id=1;
/*
 count
-------
  5558  <--------- far fewer blocks to read for `host_id=1`
*/

new analyze (48.804 ms):

explain analyze select count(bar) from foo where host_id=1;
/*
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16110.64..16110.65 rows=1 width=32) (actual time=48.760..48.761 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=131.18..16098.14 rows=5000 width=32) (actual time=8.402..32.439 rows=100000 loops=1)
         Recheck Cond: (host_id = 1)
         ->  Bitmap Index Scan on nu_foo  (cost=0.00..129.93 rows=5000 width=0) (actual time=7.636..7.636 rows=100000 loops=1)
               Index Cond: (host_id = 1)
 Total runtime: 48.804 ms
*/

clean up:

drop schema stack cascade;

The above is workable now, but is a bit quirky (needing to turn off auto-vacuum for the table) and requires regular full-scanning the table. I think something similar without the disadvantages could built into postgres. You'd need:

A space efficient index to cluster on (this is coming in 9.4 with GIN compression, or better still in 9.5 with the new BRIN index type)
A 'vacuum-like' process that would scan that index to detect which blocks need to be deleted/reinserted (this would ideally be able to reinsert the rows into fresh blocks so auto-vacuum can be left at default)

Sql-server – Primary Key choice on table with unique identifier

There are a few things to consider here:

How is data most often looked up in this table?
How is data most often sorted in this table?
Does this table relate to others as the parent record (i.e. will other tables FK to the PK of this table)?

Also, please keep in mind:

The key fields of a Clustered Index are copied into the non-clustered indexes on the same table
For Clustered Indexes that are not a Primary Key (implied unique) or at least declared as UNIQUE, a hidden "uniqueifier" field is added to rows that would otherwise be duplicates.
NEWSEQUENTIALID() is sequential per each restart of the SQL Server service. It is possible that the starting value after a restart is less than the previous lowest value.

Hence:

If the PK field of this table shows up as a FK field in other tables, there is a definite performance implication of choosing to use a UNIQUEIDENTIFIER instead of an INT as the FKed tables would then have a larger FK field in them.
How many rows do you really expect to have in this table? INT is 4 bytes (compared to the 8 bytes of a BIGINT) and has a max value of 2,147,483,647. If you might have slightly over 2.14 billion items, you can also start the IDENTITY range at the min value of each datatype, which for INT is -2,147,483,648. Starting at the low-end gives you the full 4.294 billion values to use. Compared to the 8 bytes of the DATETIME field if going with the Created field, plus add in the size of another field to make it unique, or the uniqueifier for any duplicate rows.
Since the key field(s) of the Clustered Index are included in Nonclustered indexes, that increases the chances of having a covering index without needing to INCLUDE other columns. Meaning, if you have the Clustered Index on the INT PK and a Nonclustered index on the UNIQUEIDENTIFIER, then JOINing to another table on that INT PK field while specifying the GUID value in a WHERE clause (assuming no other fields from this table are in the query) won't have to go back to the table since the Nonclustered Index will have both of the requiered fields in it. Does the Created field give the same benefit? Likely not.

IF no other tables ever JOIN to this table:

THEN it might be ok to use the Created field as the non-unique Clustered Index and the Id field as the Nonclustered PK.
ELSE it is typically best to add an INT (unless you need more than 4.294 billion values) IDENTITY field, AssetId, as the Clustered PK, and the Id UNIQUEIDENTIFIER field as a Nonclustered Index. Since you likely already have code referencing the UNIQUEIDENTIFIER field as Id, I wouldn't change that name.

Best Answer

uuid-ossp module

Related Solutions

Postgresql – How to achieve clustering of rows without the exclusive lock and logging overhead of the `cluster` command

Sql-server – Primary Key choice on table with unique identifier

Related Question

`uuid-ossp` module