Postgresql – Shall I use enum when are too many “categories” with PostgreSQL

importperformancepostgresqlpostgresql-performance

Problem

I plan to load a CSV with more 10 million records into PostgreSQL v12.1, one of its columns has "categorical" values, so creating an enumerated type for it seems to be a good choice, but it holds 208 categories.

The shortest field is 2 and longest is 11 character long. The Average of all fields is 2.4. The character encoding is UTF8, but all characters are ASCII.

Questions:

Which type should I use enumerated or varchar?

Additional info

I discard char because the official PostgreSQL documentation states the followig about char, varchar and text:

Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a
few extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, there is no such advantage
in PostgreSQL; in fact character(n) is usually the slowest of the
three because of its additional storage costs. In most situations text
or character varying should be used instead.

An enum value in PostgreSQL occupies 4 bytes on the disk (see 8.7.4. Implementation Details). Considering this and 2.4 average string length using the enum type would lead a slightly higher disk usage (Short stings in PostgreSQL needs one extra byte disk space). Still I have the intiution that using enum is a better choice, because its implementation makes many operations faster against it.

Best Answer

With an average of 2.4 characters (more relevant: avg bytes - but that's the same for all ASCII characters) I would not bother to use enums. Those occupy 4 bytes on disk plus, possibly, alignment padding. (text does not require alignment padding.) You are not even saving storage and get more overhead for it.

With most values below 7 characters (= 8 bytes on disk), an index on a text category column will also be only slightly bigger than one on an enum. (Space for data is (typically) allocated in multiples of 8 bytes.)

For a fixed number of 208 categories, a "char" encoding (not to be confused with char!) might be an option to save storage. See:

But, again, not worth the trouble for such small strings. Just use text. Maybe enforce correctness with a FK constraint to a category table like:

CREATE TABLE category (category text PRIMARY KEY);

Also a good place to store additional information per category. And you can easily modify the set of categories. Make the FK constraint ON UPDATE CASCADE and you can change category names in one central place. Make it ON DELETE SET NULL , and you can easily remove a category. Etc.

Related: