Postgresql – Is there some sort of shortcut/feature in PostgreSQL to avoid horrible storage waste with duplicate values

postgresql

Let's say that I have this table which keeps track of every page load on my website:

CREATE TABLE "example.com page loads"
(
    id                      bigserial,
    "URL"                   text NOT NULL,
    "IP address"            inet NOT NULL,
    "user agent"            text,
    "timestamp"             timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY             (id)
)

If the same person loads 100 pages, or many other people with the same exact user-agent string load 10,000 pages, the same long "user agent" string will be stored redundantly 100/10,000 times in my poor table, massively inflating it.

This was always a huge problem to me when I used to use plaintext webserver logs, and later when I did the exact same thing as I'm describing right now (a database table in PostgreSQL).

A very obvious and immediate thought that pops up in my head is: "Why can't the user agents be automatically stored just once, internally, and then automatically 'referenced' by PostgreSQL in whatever manner it is comfortable with, while never exposing this internal optimization to me?"

That is, I don't want to have to make a separate table like this:

CREATE TABLE "example.com unique user agents"
(
    id                      bigserial,
    "user agent"            text,
    PRIMARY KEY             (id),
    UNIQUE                  ("user agent")
)

… and then be forced to do expensive and annoying manual queries to look up whether the user-agent is already present in the table of unique user agents and then have a column called "unique user agent id" referencing this table from the "page loads" table instead of a nice, simple text column.

I'm sure you understand exactly what I mean. Basically, it's such a common/obvious thing that I'm 99% sure that this must already have been solved a looong time ago, only I have just never realized it.

There is probably some simple feature to do exactly this, such as (this is just my guess):

CREATE TABLE "example.com page loads"
(
    id                      bigserial,
    "URL"                   text NOT NULL,
    "IP address"            inet NOT NULL,
    "user agent"            text OPTIMIZE_UNIQUELY_INTERNALLY,
    "timestamp"             timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY             (id)
)

That would be lovely, if there is such a "OPTIMIZE_UNIQUELY_INTERNALLY" flag that I can just apply to columns when I want this to be done "under the hood", without me having to think about it!

If there is such a thing, that would save me enormous amounts of storage and headaches.

I don't think that this is the same thing as indexes. Making the "user agent" column into an index won't make PG store each unique value only once, would it? It would only create an additional "look-up" table for quicker queries?

Best Answer

There is no magical automatism for that. You will have to create the lookup table yourself.

This is the way relational databases are designed: you spread your data over several tables. For example, if you normalize your schema, you will end up with more tables than entities. In a way, this lookup table can be seen as a kind of normalization, since the user agent doesn't feel atomic to you.

Don't worry about having more than one table: inner joins are quite simple and readable in SQL, and databases are optimized to process them efficiently.