Postgresql – Does the optimized column order for a PostgreSQL table always have variable length types at the end

postgresql

There's a popular and seemingly authoritative blog post called On Rocks and Sand on how to optimize PostgreSQL tables for size to eliminate internal padding by re-ordering their column length. They explain how variable-length types incur some extra padding if they're not at the end of the table:

This means we can chain variable length columns all day long without introducing padding except at the right boundary. Consequently, we can deduce that variable length columns introduce no bloat so long as they’re at the end of a column listing.

And at the end of the post, to summarize:

Sort the columns by their type length as defined in pg_type.

There's a library that integrates with Ruby's ActiveRecord to automatically re-order columns to reduce padding called pg_column_byte_packer. You can see the README in that repo cites the above blog post and in general does the same thing that the blog post describes.

However, the pg_column_byte_packer does not return results consistent with the blog post it cites. The blog post pulls from from PostgreSQL's internal pg_type.typelen which puts variable-length columns always at the end via an alignment of -1. pg_column_byte_packer gives them an alignment of 3.

pg_column_byte_packer has an explanatory comment:

    # These types generally have an alignment of 4 (as designated by pg_type
    # having a typalign value of 'i', but they're special in that small values
    # have an optimized storage layout. Beyond the optimized storage layout, though,
    # these small values also are not required to respect the alignment the type
    # would otherwise have. Specifically, values with a size of at most 127 bytes
    # aren't aligned. That 127 byte cap, however, includes an overhead byte to store
    # the length, and so in reality the max is 126 bytes. Interestingly TOASTable
    # values are also treated that way, but we don't have a good way of knowing which
    # values those will be.
    #
    # See: `fill_val()` in src/backend/access/common/heaptuple.c (in the conditional
    # `else if (att->attlen == -1)` branch.
    #
    # When no limit modifier has been applied we don't have a good heuristic for
    # determining which columns are likely to be long or short, so we currently
    # just slot them all after the columns we believe will always be long.

The comment appears to be not wrong as text columns do have a pg_type.typalign of 4 but they've also got a pg_type.typlen of -1 which the blog post argues gets the most optimal packing when at the end of the table.

So in the case of a table that has an integer column, a text column, and a smallint column, pg_column_byte_packer will put the text columns right in between the two. They've even got a unit test to assert that this always happens.

My question here is: what order of columns actually packs for minimal space? The comment from pg_column_byte_packer appears to be not wrong as text columns do have a pg_type.typalign of 4, but they've also got a pg_type.typlen of -1.

Best Answer

The comment from pg_column_byte_packer appears to be not wrong as text columns do have a pg_type.typalign of 4, but they've also got a pg_type.typlen of -1.

Had me puzzled, too, when I researched it a couple of years back. typlen = -1 just indicates varlena storage, which nominally has typalign = 'i' (integer alignment, needs to start at 4-bytes offset). But that's not the whole story. Eventually, I found the explanation in a note in the source code:

Note also that we allow the nominal alignment to be violated when storing "packed" varlenas; the TOAST mechanism takes care of hiding that from most code.

So a varlena datum < 127 bytes (after possible compression) only adds 1 byte of overhead (signifying its length) and requires no alignment padding "on disk". (Hardly any "disks" any more these days.)

See:

To answer the question in the title:

Does the optimized column order for a PostgreSQL table always have variable length types at the end?

True-ish. Datums exceeding 127 bytes cannot be stored in optimized form and fall back to requiring "integer alignment". If we don't know that a varlena-type column will stay below that threshold (most of the time), we can't say for sure.

Also, there are additional considerations for optimized storage. With many columns in a row, it's cheaper to compute tuple storage offsets with NOT NULL, fixed-size length columns first. Placing frequently accessed columns first also yields a tiny advantage. All of this is further complicated by the TOAST mechanism and index(-only) access.

But all of these effects are typically tiny. And losing 3 bytes to alignment padding pales in comparison when a column occupies 200 bytes. So mostly not worth the bother. The rule of thumb covers most of it:

Sort columns by their required alignment typalign:
d --> i --> s --> c.
But typlen = -1 ("varlena") last (typically), even though formally typalign = 'i'.

The manual:

c = char alignment, i.e., no alignment needed.
s = short alignment (2 bytes on most machines).
i = int alignment (4 bytes on most machines).
d = double alignment (8 bytes on many machines, but by no means all).

Your example

So in the case of a table that has an integer column, a text column, and a smallint column, pg_column_byte_packer will put the text columns right in between the two.

pg_column_byte_packer is living up to its name. int --> text --> smallint is as tight as it gets.

For the typical case of short strings, the only relevant decision is to put int first. smallint can at most force 1 additional byte of alignment padding at odd byte offsets. Since tuple space is always allocated in multiples of 8 bytes, this can never result in a bigger tuple.

Strings exceeding the threshold of 127 bytes on disk (including 1 leading length-byte), flip to 4 leading length bytes and require the nominal integer alignment. That's where putting text before smallint can safe 8 bytes effectively. With random string length that happens in 25 % of cases, so 2 bytes on average for tuples of at least 144 bytes.

There's that. But putting the smallint first typically has tiny advantages, and most text columns stay well below the length threshold.

The thing to remember is not to intersperse multiple smallint and text columns. Multiple offsets can add up in this case.