Problem
I plan to load a CSV with more 10 million records into PostgreSQL v12.1, one of its columns has "categorical" values, so creating an enumerated type for it seems to be a good choice, but it holds 208 categories.
The shortest field is 2 and longest is 11 character long. The Average of all fields is 2.4. The character encoding is UTF8, but all characters are ASCII.
Questions:
Which type should I use enumerated or varchar?
Additional info
I discard char because the official PostgreSQL documentation states the followig about char, varchar and text:
Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a
few extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, there is no such advantage
in PostgreSQL; in fact character(n) is usually the slowest of the
three because of its additional storage costs. In most situations text
or character varying should be used instead.
An enum value in PostgreSQL occupies 4 bytes on the disk (see 8.7.4. Implementation Details). Considering this and 2.4 average string length using the enum type would lead a slightly higher disk usage (Short stings in PostgreSQL needs one extra byte disk space). Still I have the intiution that using enum is a better choice, because its implementation makes many operations faster against it.
Best Answer
With an average of 2.4 characters (more relevant: avg bytes - but that's the same for all ASCII characters) I would not bother to use enums. Those occupy 4 bytes on disk plus, possibly, alignment padding. (
text
does not require alignment padding.) You are not even saving storage and get more overhead for it.With most values below 7 characters (= 8 bytes on disk), an index on a
text
category column will also be only slightly bigger than one on anenum
. (Space for data is (typically) allocated in multiples of 8 bytes.)For a fixed number of 208 categories, a
"char"
encoding (not to be confused withchar
!) might be an option to save storage. See:But, again, not worth the trouble for such small strings. Just use
text
. Maybe enforce correctness with a FK constraint to acategory
table like:Also a good place to store additional information per category. And you can easily modify the set of categories. Make the FK constraint
ON UPDATE CASCADE
and you can change category names in one central place. Make itON DELETE SET NULL
, and you can easily remove a category. Etc.Related: