Best practices for integer key types – mixed or all the same integer type

database-designdatatypesprimary-key

I am designing a postgres database for scientific data. It has about 30-40 tables, many of which are lookup tables with less than 10 records. The bulk of the data are concentrated in just a few tables with ~ 1,000,000 records that reference the lookup tables with foreign keys. There will be web user interface that accesses the data via a GraphQL API.

I always use the same INT data type for all primary and foreign key types. Is there merit in optimizing the primary keys on the lookup tables as SMALLINT to conserve space on the big tables (and perhaps a modest overall performance)? Or is it best to just keep all key columns the same data type to reduce complexity and confusion for humans and software accessing the data?

(The lookup data could change over time so I can't use ENUM.)

Best Answer

It's a little subjective, and there's certainly nothing wrong with reduced complexity by defaulting to INT (since it covers most use cases and there's almost no tangible difference in performance) but generally if your tables are going to be so small in record count, using SMALLINT is a good idea from a best practice standpoint. SMALLINT gets you +-32767 rows (for a total of 65,534). So it is definitely sufficient for your small tables.

When deciding on the size for a datatype, generally planning for anywhere between twice as much up to a magnitude more of the max amount values you think is possible is a decent rule of thumb, depending on how confident you are in that rough estimation.

At the end of the day, if 25% of your tables have 1,000,000 rows in them, switching one of the smaller tables to SMALLINT instead of INT will only save you about 20 MB of space overall (which is peanuts). If you did it for all your small tables, then at most you're saving about half a gigabyte of space. There's probably better optimizations to go after currently and you can let sleeping dogs lie. Moving forward if you want to take SMALLINT into better consideration on new tables, it's not a bad idea.

Related Solutions

Mysql – Character vs Integer primary keys

It depends on your engine. Common wisdom is that reads are cheap, a few bytes here and there will not significantly impact the performance of a small to medium size database.

More importantly, it depends on the uses to which you will put the primary key. Integer serials have the advantage of being simple to use and implement. They also, depending on the specific implementation of the serialization method, have the advantage of being quickly derivable, as most databases just store the serial number in a fixed location, rather than deriving it with Select max(ID)+1 from foo on the fly.

The question becomes: how does a 5 character key present a "meaningful value" to you and to the application? How is this value created, and does it take more or less time than finding an incrementing serial number. While there is a trivial amount of space saved in some integers, the vast majority of systems will ignore this space savings.

There are no performance implications, save that the character scheme requires that there never be an automatic engine, as your "keys" are underivable. For your specific domain, don't bother with artificial keys, and just use Chinese, Japanese and Thai as key names. While you cannot guarantee uniqueness over any possible application, in your scope it is much more reasonable to use them instead of horrible and forced 5-character abbreviations. There are no significant performance impacts until you get to the millions of tuples.

Alternatively, if you're just tracking by country of origin, and not specific regional cuisines (Cantonese, Sichuan, Sicilian, Umbrian, Calabrian, Yucatecan, Oaxacan, etc.), you could always just use ISO 3166 codes.

If I have 10,000 recipes doesn't the difference between a 5-character and 20-character key start to add up?

Space is cheap. When you're talking 10,000,000 recipes that you're doing OLAP operations on, then, maybe. With 10k recipes, you're looking at 150k of space.

But again, it depends. If you have many millions of records, and are doing joins on them, then it makes sense to denormalize the lookup for something this trivial (into a materialized view). For all practical purposes, the relative join efficiency on a modern machine between a 5 character key and variable length key is so similar to be identical. Happily, we live in a world of plentiful CPU and plentiful disk. The nasty ones are too many joins and query inefficiency, rather than character-by-character comparison. With that said, always test.

P&T things of this level are so database-dependent that generalizations are extremely difficult. Build two sample models of the database, populate them with the estimated numbers of records, then see which one is faster. In my experience, character length doesn't make a huge difference compared with good indexes, good memory configurations, and other critical performance tuning elements.

Postgresql – What happens to the index of a primary key after a DROP CONSTRAINT

Disclaimer

This is experimental and only tested rudimentarily. Proceed at your own risk. I would not use it myself and just drop / recreate constraints with standard DDL commands. If you break entries in the catalog tables you could easily mess up your database.

For all I know, there are only two differences between a PRIMARY KEY and a UNIQUE constraint in the catalog tables (the index itself is identical):

pg_index.indisprimary:
For PRIMARY KEY constraint ... TRUE
For UNIQUE constraint ... FALSE

pg_constraint.contype:
PRIMARY KEY constraint ... 'p'
UNIQUE constraint ... 'u'

You could convert constraint and index in place, from PRIMARY KEY constraint to UNIQUE constraint, my_idx being the (optionally schema-qualified) index name:

UPDATE pg_index SET indisprimary = FALSE WHERE indexrelid = 'my_idx'::regclass
UPDATE pg_constraint SET contype = 'u' WHERE conindid = 'my_idx'::regclass;

Or upgrade from UNIQUE to PRIMARY KEY:

UPDATE pg_index SET indisprimary = TRUE WHERE indexrelid = 'my_idx'::regclass;
UPDATE pg_constraint SET contype = 'p' WHERE conindid = 'my_idx'::regclass;

Best Answer

Related Solutions

Mysql – Character vs Integer primary keys

Postgresql – What happens to the index of a primary key after a DROP CONSTRAINT

Disclaimer

Related Question