Postgresql – optimizing or estimating postgresql table size on disk

disk-spacepostgresql

This question is in the context of the helpcovid GPLv3+ project (work in progress in March 2020). With PostgreSQL 11 on Debian/Buster/x86-64. See the file hcv_database.cc there.

I have a table of web users:

---- TABLE tb_user
CREATE TABLE IF NOT EXISTS tb_user (
  user_id SERIAL PRIMARY KEY NOT NULL,  -- unique user_id
  user_firstname VARCHAR(31) NOT NULL,  -- first name, in capitals, UTF8
  user_familyname VARCHAR(62) NOT NULL, -- family name, in capitals, UTF8
  user_email VARCHAR(71) NOT NULL,      -- email, in lowercase, UTF8
  user_gender CHAR(1) NOT NULL,         -- 'F' | 'M' | '?'
  user_crtime DATE NOT NULL             -- user entry creation time
); --- end TABLE tb_user

Imagine I have a million rows in that table, and 99% of users have a user_firstname of less than 16 bytes, and 99% of users have a user_family of less than 32 bytes.

Assume that this table is only filled by INSERT SQL statements.

What would be the estimated disk space consumption of the PostgreSQL files (under /var/lib/postgreql) containing that database. As measured by du -h

For simplicity, assume that this table is the only one in the database.

To ask the question otherwise, the VARCHAR(31) size is used for every row on disk, or not?

In other words, how practically important is the width of columns in tables, w.r.t to disk space?

Best Answer

The best way is to run a little benchmark:

INSERT INTO tb_user
   (user_firstname, user_familyname, user_email, user_gender, user_crtime)
SELECT substr(md5(random()::text), 1, 15),
       substr(md5(random()::text), 1, 15),
       md5(random()::text),
       'F',
       current_date
FROM generate_series(1, 1000000);

SELECT pg_size_pretty(pg_total_relation_size('tb_user'));

 pg_size_pretty 
----------------
 126 MB
(1 row)

A text or varchar that is short enough (which the ones in the table certainly are) will consume one byte more than the number of bytes in the string.

There is some overhead caused by block and tuples headers and other metadata, as well as by fragmentation (some empty space will remain in each 8kB block). And then the index that supports the primary key is also sizeable:

\di+ tb_user_pkey 
                            List of relations
 Schema  |     Name     | Type  |  Owner  |  Table  | Size  | Description 
---------+--------------+-------+---------+---------+-------+-------------
 laurenz | tb_user_pkey | index | laurenz | tb_user | 21 MB | 
(1 row)

Some additional hints:

You can save a little space by having user_crtime in the second position. Otherwise, you might lose up to three bytes to padding.
Use bigint for the primary key so that you can never run out of values. Trying to save on these 4 bytes is usually not a good idea.

Related Solutions

Sql-server – Moving data from table with VARCHAR(50) fields to table with numeric fields increases table size

Interesting question, I created both table1 and table2 in SQL Server 2014 and populated 2000 rows of the same data you have provided. When checking the amount of space consumed by each table, here is what I got:

table1 (varchar) - 392K
table2 (bit) - 168K

However when I alter all the column types in table1 to match table2, the space used grew to 540K. After rebuilding the table, the space drops down to 160K:

alter table table1 rebuild;

So you should see the disk space savings using the leaner data types after reclaiming the used space. Best of luck.

Postgresql – How to check if there are no time gaps in PostgreSQL

Chapter 1: Linked List

One way to have this (no gaps) type of constraint enforced in the database is to split activity into the starting and ending parts and then use UNIQUE and FOREIGN KEY constraints to emulate a linked list.

1a.

Every activity_start should reference the previous activity_end:
(country, activity_start) REFERENCES (country, activity_end).
Two periods cannot have the same activity_start and country or activity end and country:
UNIQUE constraint on (country, activity_start).
and UNIQUE (country, activity_end).
We actually don't need both of them, only the second one, for the foreign key to be defined. The exclusion constraint doesn't allow two periods to have the same start or end.
We should not allow multiple rows with (-Infinity, +Infinity) or series like: -Infinity -> DateA -> Infinity -> DateB -> +Infinity. This is achieved with the two partial indexes.

Code:

CREATE TABLE expenses_commissionrule (
    id serial NOT NULL PRIMARY KEY,
    country varchar(2) NOT NULL,
    created timestamp with time zone NOT NULL,
    modified timestamp with time zone NOT NULL,
    activity tsrange NOT NULL,
    activity_start timestamp,
    activity_end timestamp,
    CONSTRAINT non_overlapping_activity
        EXCLUDE USING GIST (country WITH =, activity WITH &&),
    CONSTRAINT country_activity_end_uq
        UNIQUE (country, activity_end),
    CONSTRAINT activity_start_end_fk
        FOREIGN KEY (country, activity_start)
        REFERENCES expenses_commissionrule (country, activity_end),
    CONSTRAINT activity_ck
        CHECK (activity IS NOT DISTINCT FROM 
               tsrange(activity_start, activity_end, '[)') )
) ;


CREATE UNIQUE INDEX country_start_ufx
    ON expenses_commissionrule
        (country)
    WHERE (activity_start IS NULL) ;

CREATE UNIQUE INDEX country_end_ufx
    ON expenses_commissionrule
        (country)
    WHERE (activity_end IS NULL) ;

We can then try to insert (valid) data:

WITH ins
    (country, activity_start, activity_end)
  AS
    ( VALUES
          ('IT',  null::timestamp,  null::timestamp),

          ('FR',  null,             '2000-01-01'),
          ('FR',  '2000-01-01',     null),

          ('GR',  null,             '2000-01-01'),
          ('GR',  '2000-01-01',     '2012-01-01'),
          ('GR',  '2012-01-01',     '2017-06-01'),
          ('GR',  '2017-06-01',     null)
    )
INSERT INTO expenses_commissionrule
    (country, created, modified, activity, activity_start, activity_end)
SELECT
    country, now(), now(),
    tsrange(activity_start, activity_end, '[)'),
    activity_start, activity_end
FROM ins ;

Works fine:

> INSERT 0 7

And try with invalid data:

--
( VALUES
      ('US',  null::timestamp,  '2000-01-01'::timestamp)
)
--

-- Fails:

> ERROR:  insert or update on table "expenses_commissionrule" violates 
      foreign key constraint "activity_start_end_fk"
> DETAIL:  Key (country, activity_end)=(US, 2000-01-01 00:00:00) is not
      present in table "expenses_commissionrule".

Another try:

( VALUES
      ('UK',  null::timestamp,  '2000-01-01'::timestamp),
      ('UK',  '2000-01-01',     '2000-01-01')
)

-- Fails:

> ERROR:  duplicate key value violates unique constraint 
      "country_activity_end_uq"
> DETAIL:  Key (country, activity_end)=(UK, 2000-01-01 00:00:00) 
      already exists.

1b.

After all these, we can spot that activity is not really needed in the table as we have start and end and we can compute it. Thus it can be removed:

CREATE TABLE expenses_commissionrule (
    id serial NOT NULL PRIMARY KEY,
    country varchar(2) NOT NULL,
    created timestamp with time zone NOT NULL,
    modified timestamp with time zone NOT NULL,
    activity_start timestamp,
    activity_end timestamp,
    CONSTRAINT non_overlapping_activity
        EXCLUDE USING GIST 
            (country WITH =, 
             tsrange(activity_start, activity_end, '[)') WITH &&),
    CONSTRAINT country_activity_end_uq
        UNIQUE (country, activity_end),
    CONSTRAINT activity_start_end_fk
        FOREIGN KEY (country, activity_start)
        REFERENCES expenses_commissionrule (country, activity_end)
) ;

-- plus the two filtered indexes. We do need those.

1c.

And then we realize that - due to the foreign key we added - we don't really need the exclusion constraint any more. We can have the same effect by enforcing that activity_end is after activity_start.

CREATE TABLE expenses_commissionrule (
    id serial NOT NULL PRIMARY KEY,
    country varchar(2) NOT NULL,
    created timestamp with time zone NOT NULL,
    modified timestamp with time zone NOT NULL,
    activity_start timestamp,
    activity_end timestamp,
    CONSTRAINT non_overlapping_activity
        CHECK (activity_start < activity_end),
    CONSTRAINT country_activity_end_uq
        UNIQUE (country, activity_end),
    CONSTRAINT activity_start_end_fk
        FOREIGN KEY (country, activity_start)
        REFERENCES expenses_commissionrule (country, activity_end)
) ;


-- plus the two filtered indexes. We do need those.

Chapter 2: No List

After the all the tedious effort, let try something simpler. No explanation this time, lets read the code first and explain later:

CREATE TABLE ec_rule (
    id serial NOT NULL PRIMARY KEY,
    country varchar(2) NOT NULL,
    created timestamp with time zone NOT NULL,
    modified timestamp with time zone NOT NULL,
    activity_end timestamp,
    CONSTRAINT country_activity_end_uq
        UNIQUE (country, activity_end)
) ;

CREATE UNIQUE INDEX country_end_ufx
    ON ec_rule (country)
    WHERE (activity_end IS NULL) ;

OK, what happened here? This is much simpler, it can't possibly be used the same way as the previous designs! It can't possibly be equivalent. Or maybe it can?

Lets see what happens here:

activity_start is missing completely. How are we supposed to find the periods if they are not stored in the table?

The answer is that activity start is indeed stored, just not in the same row as activity end. That's what the foreign key was all about, to ensure that every end had a matching start. So we can easily find the start of every period using LAG():

CREATE VIEW expenses_commissionrule AS
SELECT 
    id,
    country,
    created,
    modified,
    LAG(activity_end) OVER (PARTITION BY country
                            ORDER BY activity_end)
        AS activity_start,
    activity_end
FROM
    ec_rule ;

Chapter 3: Spot the difference

While all the above design manage to enforce the "no gaps" rule, they all fail to enforce the second rule: "the sum of all intervals should be -INF : +INF."

Can this be amended? It seems to be easier after all once we have achieved no gaps.

Well, yes and no. And no. It isn't easier. It's similar to enforcing that a table has at least one row. That seems easy, too, but it's actually very hard, if not impossible with DDL alone.

For the specific problem though, the "yes and no" means that the rule:

can be enforced with design 1 (although it makes it even more complicated).
but not with design 2 (at least I can't see a way).

For design 1, we'd need to add a second foreign key (from activity end to the next start) and a unique constraint (country, activity_start). This essentially would convert our list to a doubly linked list. And for the foreign keys to be satisfied, the list would have to either be infinite (which isn't possible) or have ends both left and right, which means two rows with nulls, one for left and one for right (foreign keys are satisfied if one of the columns is NULL).

For design 2, you'd have to ensure - outside of DDL - that for every country there is one row where activity_end is NULL (that row is the rule for the period up to +Infinity).

Best Answer

Related Solutions

Sql-server – Moving data from table with VARCHAR(50) fields to table with numeric fields increases table size

Postgresql – How to check if there are no time gaps in PostgreSQL

Related Question