Postgresql – Detailed optimal column order with large text and bytea

disk-spaceperformancepostgresql

I read this excellent answer on optimal column ordering and noticed the summation:

Generally speaking, if you put the 8-byte-types first then the 4-byte-types and the 2-byte-types last you can't go wrong. Text or boolean do not have alignment restrictions like that, some other types do. In the end, you may save a couple of bytes per row at best. So, none of this is necessary for most people in most cases. But in your case it might save a couple of Gigabytes easily.

Does that mean that columns should be ordered from most space occupation to least?

If so, for bytea columns that always have constant 16-byte, 32-byte, or 64-byte lengths, do the same rules apply? What about a text column that varies between 1kb to 5mb, heavily skewed to 1kb?

The bytea columns are the variables used in all read conditions.

The upper limit on this table's length is tens of billions of rows per year.

Best Answer

Does that mean that columns should be ordered from most space occupation to least?

No, not necessarily. You can play "column tetris" to minimize padding and thereby save some space. The rule of thumb I gave and you quoted is one simple strategy for basic types that require alignment.

As I mentioned in the quoted answer, you can test the actual storage size (excluding item identifier) with pg_column_size() on the whole row.

text and related varchar and char types do not require padding, so there is nothing to gain. The same is true for your bytea columns.

Concerning storage size for:

bytea columns that always have constant 16-byte, 32-byte, or 64-byte lengths

The manual page on bytea tells us :

Storage Size
1 or 4 bytes plus the actual binary string

That means, the actual space required for a bytea column of 16-byte, 32-byte, or 64-byte length is 17 or 20 byte, 33 or 36 byte etc. respectively.

As demonstrated in this SQL Fiddle, a bytea variable always has an overhead of 4 bytes. When stored in a column, however, it starts out with just 1 byte of overhead and switches to 4 bytes for values of 127 bytes length or more.
24 bytes of overhead are added for the row type.
Another 4 bytes are needed for the item identifier per tuple in the data page. Details in this related answer:

Configuring PostgreSQL for read performance

As for alignment requirements of bytea, per documentation:

Values with single-byte headers aren't aligned on any particular boundary, either.

I would suggest you read that whole chapter - probably a couple of times, it's a tough read.

Related Solutions

Postgresql – Optimizing queries on a range of timestamps (two columns)

For Postgres 9.1 or later:

CREATE INDEX idx_time_limits_ts_inverse
ON time_limits (id_phi, start_date_time, end_date_time DESC);

In most cases the sort order of an index is hardly relevant. Postgres can scan backwards practically as fast. But for range queries on multiple columns it can make a huge difference. Closely related:

PostgreSQL index not used for query on range

Consider your query:

SELECT *
FROM   time_limits
WHERE  id_phi = 0
AND    start_date_time <= '2010-08-08 00:00'
AND    end_date_time   >= '2010-08-08 00:05';

Sort order of the first column id_phi in the index is irrelevant. Since it's checked for equality (=), it should come first. You got that right. More in this related answer:

Multicolumn index and performance

Postgres can jump to id_phi = 0 in next to no time and consider the following two columns of the matching index. These are queried with range conditions of inverted sort order (<=, >=). In my index, qualifying rows come first. Should be the fastest possible way with a B-Tree index¹:

You want start_date_time <= something: index has the earliest timestamp first.
If it qualifies, also check column 3.
Recurse until the first row fails to qualify (super fast).
You want end_date_time >= something: index has the latest timestamp first.
If it qualifies, keep fetching rows until the first one doesn't (super fast).
Continue with next value for column 2 ..

Postgres can either scan forward or backward. The way you had the index, it has to read all rows matching on the first two columns and then filter on the third. Be sure to read the chapter Indexes and ORDER BY in the manual. It fits your question pretty well.

How many rows match on the first two columns?
Only few with a start_date_time close to the start of the time range of the table. But almost all rows with id_phi = 0 at the chronological end of the table! So performance deteriorates with later start times.

Planner estimates

The planner estimates rows=62682 for your example query. Of those, none qualify (rows=0). You might get better estimates if you increase the statistics target for the table. For 2.000.000 rows ...

ALTER TABLE time_limits ALTER start_date_time SET STATISTICS 1000;
ALTER TABLE time_limits ALTER end_date_time   SET STATISTICS 1000;

... might pay. Or even higher. More in this related answer:

Check statistics targets in PostgreSQL

I guess you don't need that for id_phi (only few distinct values, evenly distributed), but for the timestamps (lots of distinct values, unevenly distributed).
I also don't think it matters much with the improved index.

`CLUSTER` / pg_repack / pg_squeeze

If you want it faster, yet, you could streamline the physical order of rows in your table. If you can afford to lock your table exclusively (at off hours for instance), rewrite your table and order rows according to the index with CLUSTER:

CLUSTER time_limits USING idx_time_limits_inversed;

Or consider pg_repack or the later pg_squeeze, which can do the same without exclusive lock on the table.

Either way, the effect is that fewer blocks need to be read from the table and everything is pre-sorted. It's a one-time effect deteriorating over time with writes on the table fragmenting the physical sort order.

GiST index in Postgres 9.2+

¹ With pg 9.2+ there is another, possibly faster option: a GiST index for a range column.

There are built-in range types for timestamp and timestamp with time zone: tsrange, tstzrange. A btree index is typically faster for an additional integer column like id_phi. Smaller and cheaper to maintain, too. But the query will probably still be faster overall with the combined index.
Change your table definition or use an expression index.
For the multicolumn GiST index at hand you also need the additional module btree_gist installed (once per database) which provides the operator classes to include an integer.

The trifecta! A multicolumn functional GiST index:

CREATE EXTENSION IF NOT EXISTS btree_gist;  -- if not installed, yet

CREATE INDEX idx_time_limits_funky ON time_limits USING gist
(id_phi, tsrange(start_date_time, end_date_time, '[]'));

Use the "contains range" operator @> in your query now:

SELECT *
FROM   time_limits
WHERE  id_phi = 0
AND    tsrange(start_date_time, end_date_time, '[]')
    @> tsrange('2010-08-08 00:00', '2010-08-08 00:05', '[]')

SP-GiST index in Postgres 9.3+

An SP-GiST index might be even faster for this kind of query - except that, quoting the manual:

Currently, only the B-tree, GiST, GIN, and BRIN index types support multicolumn indexes.

Still true in Postgres 12.
You would have to combine an spgist index on just (tsrange(...)) with a second btree index on (id_phi). With the added overhead, I'm not sure this can compete.
Related answer with a benchmark for just a tsrange column:

Perform this hours of operation query in PostgreSQL

Postgresql – Optimal hash technique to index large text

bytea will be optimal for storing the hash.

It'll be transferred in/out of the database as a hex string anyway, unless you use PostgreSQL's binary wire protocol (supported by libpq and partly by PgJDBC) to transfer them.

For best results, store as bytea and have the client application use a PQexecParams call that requests binary results.

Though on re-reading, this is confusing:

For my implementation, the hash never needs to leave the database, but the hashed data must be compared with external data for existence frequently

Do you mean that the hash isn't transferred for comparison, the original unhashed text data is? If so, the above is irrelevant, as the binary protocol offers no benefits for text-form data.

Also: "tens of billions" of rows is a lot. PostgreSQL has quite a large per-row overhead at 28 bytes, so you're going to be losing a lot of space. Especially once you factor in index overheads too. Is PostgreSQL the right tool for this job?

A final thought: With that many rows, you're getting up into hash-collision territory. Do you care if it's possible - though unlikely - for two different strings to have the same hash, so an incorrect unique violation is reported? If that's a problem then a unique b-tree index on the hash probably isn't the right tool for the job.