Postgresql – the reason to “normalize your databases”

normalizationpostgresql

In the past, I've been told (on this site) that I should normalize the values in the database – using a lookup table instead of using direct (string) keys.

I am confused why this is so good that several people recommended this. Is it just for memory consumption? But then in my case (explained below) how much is that?

Consider I have a dictionary database for a website:

CREATE TABLE dictionary
(
    id serial NOT NULL,
    key text NOT NULL,
    language text NOT NULL,
    value text,
    PRIMARY KEY (id)
)

And then insertion would happen like:

INSERT INTO public.dictionary VALUES
    ('yes_button', 'en', 'yes'), ('yes_button', 'nl', 'ja')

Or instead of 'en' I'd use 'en-us'. Now I've been told to "normalize" the database – which would mean having a lookup table that binds the string representation of a language ('en', 'nl' to a value):

CREATE TABLE languages
(
    id serial NOT NULL,
    language text
)

CREATE TABLE dictionary
(
    id serial NOT NULL,
    key text NOT NULL,
    language integer NOT NULL,
    value text,
    PRIMARY KEY (id),
    FOREIGN KEY (language)
        REFERENCES  public.languages (id)
)

However, this would increase complexity quite a bit, since insertion can no longer be simple — it needs to either check the foreign table on the backend, or use some more complex SQL. So there is a real cost to updating to this design.

What are the advantages?

Is it just the storage size? An integer foreign key reference is 4 bytes anyways, and a size 2 string is 3 bytes (while a size 5 is 6 bytes – so save 2 bytes at most).
Is it speed of database? But isn't this then a micro-optimization, which is the "root of all evil"?
Is it just to make sure that each language "exists" before insertion? There are other mechanisms, and languages should be created on the fly anyway. Existence isn't based on what is in our database, but rather an external lookup to the standards, and the language is properly added once a user provides a single translation.

Examples of the idea were mentioned in the comments on my previous question Improving database design, is this a valid case for entity attribute values?:

language should an int and refer to a seperate table(normalizing your tables) – nbk Jan 13 at 16:54

@nbk interesting -> why? The ISO country + language codes are already unique so a string can just as well be the key right? Why would I use an extra integer redirection, when the textual description is as strong a guarantee of uniqueness? – paul23 Jan 13 at 17:58

@paul23 a normqalized table like in my desription or in the answer helps to reduce size and speed of the querys lets say you have 255 language and you save an int that reduce the size massively. A normalized data design helps noto to have redundant data in your tables – nbk Jan 13 at 18:20

The dictionary database is actually separate from the data database, since it is used by several applications.

If I use string keys through a normalizing table, the 'en-us', 'nl-nl'… might very well be perfect keys themselves. They are unique on their own, so why would I use a uuid or other randomized string? In other words, 'en-us' isn't just a string, it's specifically determined by ISO 639 standard as a locale string. Thus, the amount of strings is very much limited. It very much serves as an "identifier" like "pi" is for 3.1415…

Regarding data quality, I wouldn't have to use an integer key — I could use a string — but then the string 'en-us' is the correct string to represent the "English as spoken in US" locale.

Best Answer

Your issue is that you are getting two different pieces of advice conflated into one and the justifications for each piece of advice are not being presented clearly.

Recomendation 1: Normalize your database

In any transactional database this is generally considered a best practice. There are lots of reasons why you might back away from this and there are applications, like BI data warehouses, where this is not necessarily what you want. However, for a transactional database you normalize to begin with and denormalize when necessary as a rule of thumb.

Where there seems to be some confusion is around why to normalize. You are not alone in this confusion. A lot of people have a lot of misconceptions about the purpose of normalizing your database.

Normalization is NOT (primarily) about:

Increasing performance
Saving memory
Saving disk space
Reducing duplication (it is about reducing redundancy, but that is not exactly the same thing as duplication - more below)

Normalization IS about:

Data Quality (first and foremost)

Normalizing your database design means that you don't store a piece of data in multiple places so that if you need to update or delete it for any reason you might get yourself in trouble because the data will need to change in more than one place. When you don't normalize you end up with the very real possibility of your data becoming inconsistent over time as changes are made inconsistently. This is a little bit of an over simplification because there are other benefits of normalization, such as simplified query logic. However, data quality is by far the most important benefit and, ironically, it is the one which programmers without formal database design training most often fail to understand clearly.

Recommendation 2: Use Integer Surrogate Keys

This falls much more under the rubric of common custom than anything else. Lots of people like to use a meaningless integer primary key in all their relational database tables. This in itself is actually two pieces of advice: (i) use surrogate keys in every table and (ii) use integers for surrogate keys.

Different people will give you different reasons for why they consider these best practices and all of these reasons can be argued over on a case by case basis. The best argument I can think of for using surrogate keys is that natural keys are more likely to be changed. Changing any primary key is a giant pain so it's best to avoid if you can. The best argument I can think of for using integers for your surrogate key is that it's a nice simple data type which is compact and efficient. Again, this is highly situational so people will make an argument for or against this in different cases.

What I would say about Recommendation 2 overall is this: pick a lane and stick to it so that your code is relatively consistent and diverge from this only when you have a really compelling case of critical performance or critical efficiency and you can clearly demonstrate that diverging from your usual approach has significant benefits.

In general, I avoid using natural keys, but there are times when you can get away with it, and even times when it makes better sense to use them. The question you need to ask yourself with a natural key is "Might this change in the future?"

My rule of thumb is "if a user can see it, they're going to want to change it someday". In your specific case though, language codes are set by an international standards body, so the chance that they might change are pretty slim — it would be too big of a pain. I wouldn't hesitate in your case to use "en-us" as a key value.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – What happens to the index of a primary key after a DROP CONSTRAINT

Disclaimer

This is experimental and only tested rudimentarily. Proceed at your own risk. I would not use it myself and just drop / recreate constraints with standard DDL commands. If you break entries in the catalog tables you could easily mess up your database.

For all I know, there are only two differences between a PRIMARY KEY and a UNIQUE constraint in the catalog tables (the index itself is identical):

pg_index.indisprimary:
For PRIMARY KEY constraint ... TRUE
For UNIQUE constraint ... FALSE

pg_constraint.contype:
PRIMARY KEY constraint ... 'p'
UNIQUE constraint ... 'u'

You could convert constraint and index in place, from PRIMARY KEY constraint to UNIQUE constraint, my_idx being the (optionally schema-qualified) index name:

UPDATE pg_index SET indisprimary = FALSE WHERE indexrelid = 'my_idx'::regclass
UPDATE pg_constraint SET contype = 'u' WHERE conindid = 'my_idx'::regclass;

Or upgrade from UNIQUE to PRIMARY KEY:

UPDATE pg_index SET indisprimary = TRUE WHERE indexrelid = 'my_idx'::regclass;
UPDATE pg_constraint SET contype = 'p' WHERE conindid = 'my_idx'::regclass;

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – What happens to the index of a primary key after a DROP CONSTRAINT

Disclaimer

Related Question