Postgresql – Is there some sort of shortcut/feature in PostgreSQL to avoid horrible storage waste with duplicate values

postgresql

Let's say that I have this table which keeps track of every page load on my website:

CREATE TABLE "example.com page loads"
(
    id                      bigserial,
    "URL"                   text NOT NULL,
    "IP address"            inet NOT NULL,
    "user agent"            text,
    "timestamp"             timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY             (id)
)

If the same person loads 100 pages, or many other people with the same exact user-agent string load 10,000 pages, the same long "user agent" string will be stored redundantly 100/10,000 times in my poor table, massively inflating it.

This was always a huge problem to me when I used to use plaintext webserver logs, and later when I did the exact same thing as I'm describing right now (a database table in PostgreSQL).

A very obvious and immediate thought that pops up in my head is: "Why can't the user agents be automatically stored just once, internally, and then automatically 'referenced' by PostgreSQL in whatever manner it is comfortable with, while never exposing this internal optimization to me?"

That is, I don't want to have to make a separate table like this:

CREATE TABLE "example.com unique user agents"
(
    id                      bigserial,
    "user agent"            text,
    PRIMARY KEY             (id),
    UNIQUE                  ("user agent")
)

… and then be forced to do expensive and annoying manual queries to look up whether the user-agent is already present in the table of unique user agents and then have a column called "unique user agent id" referencing this table from the "page loads" table instead of a nice, simple text column.

I'm sure you understand exactly what I mean. Basically, it's such a common/obvious thing that I'm 99% sure that this must already have been solved a looong time ago, only I have just never realized it.

There is probably some simple feature to do exactly this, such as (this is just my guess):

CREATE TABLE "example.com page loads"
(
    id                      bigserial,
    "URL"                   text NOT NULL,
    "IP address"            inet NOT NULL,
    "user agent"            text OPTIMIZE_UNIQUELY_INTERNALLY,
    "timestamp"             timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY             (id)
)

That would be lovely, if there is such a "OPTIMIZE_UNIQUELY_INTERNALLY" flag that I can just apply to columns when I want this to be done "under the hood", without me having to think about it!

If there is such a thing, that would save me enormous amounts of storage and headaches.

I don't think that this is the same thing as indexes. Making the "user agent" column into an index won't make PG store each unique value only once, would it? It would only create an additional "look-up" table for quicker queries?

Best Answer

There is no magical automatism for that. You will have to create the lookup table yourself.

This is the way relational databases are designed: you spread your data over several tables. For example, if you normalize your schema, you will end up with more tables than entities. In a way, this lookup table can be seen as a kind of normalization, since the user agent doesn't feel atomic to you.

Don't worry about having more than one table: inner joins are quite simple and readable in SQL, and databases are optimized to process them efficiently.

Clarify `ON CONFLICT DO UPDATE` behavior

Consider the manual here:

For each individual row proposed for insertion, either the insertion proceeds, or, if an arbiter constraint or index specified by conflict_target is violated, the alternative conflict_action is taken.

Bold emphasis mine. So you do not have to repeat predicates for columns included in the unique index in the WHERE clause to the UPDATE (the conflict_action):

INSERT INTO test_upsert AS tu
       (name   , status, test_field  , identifier, count) 
VALUES ('shaun', 1     , 'test value', 'ident'   , 1)
ON CONFLICT (name, status, test_field) DO UPDATE
SET count = tu.count + 1;
WHERE tu.name = 'shaun' AND tu.status = 1 AND tu.test_field = 'test value'

The unique violation already establishes what your added WHERE clause would enforce redundantly.

Clarify partial index

Add a WHERE clause to make it an actual partial index like you mentioned yourself (but with inverted logic):

CREATE UNIQUE INDEX test_upsert_partial_idx
ON public.test_upsert (name, status)
WHERE test_field IS NULL;  -- not: "is not null"

To use this partial index in your UPSERT you need a matching conflict_target like @ypercube demonstrates:

ON CONFLICT (name, status) WHERE test_field IS NULL

Now the above partial index is inferred. However, as the manual also notes:

[...] a non-partial unique index (a unique index without a predicate) will be inferred (and thus used by ON CONFLICT) if such an index satisfying every other criteria is available.

If you have an additional (or only) index on just (name, status) it will (also) be used. An index on (name, status, test_field) would explicitly not be inferred. This doesn't explain your problem, but may have added to the confusion while testing.

Solution

AIUI, none of the above solves your problem, yet. With the partial index, only special cases with matching NULL values would be caught. And other duplicate rows would either be inserted if you have no other matching unique indexes / constraints, or raise an exception if you do. I suppose that's not what you want. You write:

The composite key is made up of 20 columns, 10 of which can be nullable.

What exactly do you consider a duplicate? Postgres (according to the SQL standard) does not consider two NULL values to be equal. The manual:

In general, a unique constraint is violated if there is more than one row in the table where the values of all of the columns included in the constraint are equal. However, two null values are never considered equal in this comparison. That means even in the presence of a unique constraint it is possible to store duplicate rows that contain a null value in at least one of the constrained columns. This behavior conforms to the SQL standard, but we have heard that other SQL databases might not follow this rule. So be careful when developing applications that are intended to be portable.

Allow null in unique column

I assume you want NULL values in all 10 nullable columns to be considered equal. It is elegant & practical to cover a single nullable column with an additional partial index like demonstrated here:

PostgreSQL multi-column unique constraint and NULL values

But this gets out of hand quickly for more nullable columns. You'd need a partial index for every distinct combination of nullable columns. For just 2 of those that's 3 partial indexes for (a), (b) and (a,b). The number is growing exponentially with 2^n - 1. For your 10 nullable columns, to cover all possible combinations of NULL values, you'd already need 1023 partial indexes. No go.

The simple solution: replace NULL values and define involved columns NOT NULL, and everything would work just fine with a simple UNIQUE constraint.

If that's not an option I suggest an expression index with COALESCE to replace NULL in the index:

CREATE UNIQUE INDEX test_upsert_solution_idx
    ON test_upsert (name, status, COALESCE(test_field, ''));

The empty string ('') is an obvious candidate for character types, but you can use any legal value that either never appears or can be folded with NULL according to your definition of "unique".

Then use this statement:

INSERT INTO test_upsert as tu(name,status,test_field,identifier, count) 
VALUES ('shaun', 1, null        , 'ident', 11)  -- works with
     , ('bob'  , 2, 'test value', 'ident', 22)  -- and without NULL
ON     CONFLICT (name, status, COALESCE(test_field, '')) DO UPDATE  -- match expr. index
SET    count = COALESCE(tu.count + EXCLUDED.count, EXCLUDED.count, tu.count);

Like @ypercube I assume you actually want to add count to the existing count. Since the column can be NULL, adding NULL would set the column NULL. If you define count NOT NULL, you can simplify.

Another idea would be to just drop the conflict_target from the statement to cover all unique violations. Then you could define various unique indexes for a more sophisticated definition of what's supposed to be "unique". But that won't fly with ON CONFLICT DO UPDATE. The manual once more:

For ON CONFLICT DO NOTHING, it is optional to specify a conflict_target; when omitted, conflicts with all usable constraints (and unique indexes) are handled. For ON CONFLICT DO UPDATE, a conflict_target must be provided.

Best Answer

Related Solutions

Postgresql table with one integer column, sorted index, with duplicate primary key

PostgreSQL UPSERT – Handling NULL Values Issue

Clarify ON CONFLICT DO UPDATE behavior

Clarify partial index

Solution

Related Question

Clarify `ON CONFLICT DO UPDATE` behavior