PostgreSQL – Indexes: Integer vs String Performance

indexindex-tuningperformancepostgresqlquery-performance

I am developing an application in Ruby on Rails with the PostgreSQL (9.4) database. For my use case, columns in tables will be looked up very frequently, as the whole point of the application is searching for very specific attributes on a model.

I am currently deciding whether to use an integer type or simply use a typical string type (e.g. character varying(255), which is the default in Rails) for the columns, as I'm not sure what the performance difference will be on the index.

These columns are enums. They have a fixed size for the amount of possible values they can have. Most enum lengths do not exceed 5, meaning the index would be more or less fixed throughout the lifetime of the application; thus, the integer and string indexes would be identical in the number of nodes.

However, the string that would be indexed could be around 20 characters long, which in memory is roughly 5x that of the integer (if an integer is 4 bytes, and the strings are pure ASCII at 1 byte per character, then this holds). I don't know how database engines do index look-ups, but if it needs to "scan" the string until it matches exactly, then in essence that means that the string lookup would be 5x slower than an integer lookup; the "scan" until match for the integer lookup would be 4 bytes instead of 20. This is what I'm imagining:

The lookup value is (integer) 4:

scanning………………………. FOUND | getting records…
|BYTE_1|BYTE_2|BYTE_3|BYTE_4|BYTE_5|BYTE_6|BYTE_7|BYTE_8|…|

The lookup value is (string) "some_val" (8 bytes):

scanning………………………………………………………………………….
FOUND | getting records…
|BYTE_1|BYTE_2|BYTE_3|BYTE_4|BYTE_5|BYTE_6|BYTE_7|BYTE_8|…|

I hope that that makes sense. Basically, because the integer takes up less space, it can be "matched on" faster than its string counterpart. Perhaps this is a completely wrong guess, but I'm no expert, so that's why I'm asking you guys! I suppose that this answer I just found seems to support my hypothesis, but I want to be sure.

The number of possible values in the column wouldn't change in using either one, so the index itself would not change (unless I added a new value to the enum). In this case, would there be a performance difference in using integer or varchar(255), or does using an integer type make more sense?

The reason I am asking is that Rails' enum type maps integers to string keys, but they aren't meant to be user-facing columns. Essentially, you can't do verification that the enum value is a valid one, because an invalid value will cause an ArgumentError before any validations can be run. Using a string type would allow validations, but if there's a performance cost I'd rather just hack away around the validation problem.

Best Answer

Short answer: integer is faster than varchar or text in every aspect. Won't matter much for small tables and / or short keys. The difference grows with the length of the keys and the number of rows.

string ... 20 characters long, which in memory is roughly 5x that of the integer (if an integer is 4 bytes, and the strings are pure ASCII at 1 byte per character, then this holds)

To be precise, character types (text or varchar) occupy exactly 21 bytes for 20 ASCII characters on disk and 23 bytes in RAM. Detailed assessment:

What is the overhead for varchar(n)?

Also important: COLLATION rules can make sorting character data more expensive - unlike numeric data types:

Would index lookup be noticeably faster with char vs varchar when all values are 36 chars

Index size is probably responsible for the lion share of performance difference in most cases. Consider the overhead per index tuple (basically the same as for a table): 4 bytes for the item identifier and 8 bytes for the index tuple header. So the index tuple for integer would amount to 20 bytes (including 4 bytes of alignment padding) and for varchar(20) with 20 ASCII characters it would be 36 bytes (also incl. padding). Details:

Configuring PostgreSQL for read performance

All the theory aside: it's best to just test:

Measure the size of a PostgreSQL table row

Postgres 9.5 introduced an optimization for sorting long strings of character data (key word "abbreviated keys"). But a bug in some C library functions on Linux forced the project to disable the feature for non-C collations in Postgres 9.5.2. Details in the release notes.

However, if you actually use Postgres enum types, most of these considerations are irrelevant, since those are implemented with integer values internally anyway. The manual:

An enum value occupies four bytes on disk.

Aside: varchar(255) used to make sense for early versions of SQL Server, which could use a more efficient data type internally up to the limit of 255 characters. But the odd length restriction of 255 characters has no special meaning in Postgres at all.

Best Answer

Related Solutions

Postgresql – Benefit to keeping a record in the database, rather than deleting it, for performance issues

MySQL – Performance Impact of Maximum TEXT/BLOB Columns per Table

Related Question