Postgresql – optimizing or estimating postgresql table size on disk

disk-spacepostgresql

This question is in the context of the helpcovid GPLv3+ project (work in progress in March 2020). With PostgreSQL 11 on Debian/Buster/x86-64. See the file hcv_database.cc there.

I have a table of web users:

---- TABLE tb_user
CREATE TABLE IF NOT EXISTS tb_user (
  user_id SERIAL PRIMARY KEY NOT NULL,  -- unique user_id
  user_firstname VARCHAR(31) NOT NULL,  -- first name, in capitals, UTF8
  user_familyname VARCHAR(62) NOT NULL, -- family name, in capitals, UTF8
  user_email VARCHAR(71) NOT NULL,      -- email, in lowercase, UTF8
  user_gender CHAR(1) NOT NULL,         -- 'F' | 'M' | '?'
  user_crtime DATE NOT NULL             -- user entry creation time
); --- end TABLE tb_user

Imagine I have a million rows in that table, and 99% of users have a user_firstname of less than 16 bytes, and 99% of users have a user_family of less than 32 bytes.

Assume that this table is only filled by INSERT SQL statements.

What would be the estimated disk space consumption of the PostgreSQL files (under /var/lib/postgreql) containing that database. As measured by du -h

For simplicity, assume that this table is the only one in the database.

To ask the question otherwise, the VARCHAR(31) size is used for every row on disk, or not?

In other words, how practically important is the width of columns in tables, w.r.t to disk space?

Related question: How does Postgres store many varchars on disk?

Best Answer

The best way is to run a little benchmark:

INSERT INTO tb_user
   (user_firstname, user_familyname, user_email, user_gender, user_crtime)
SELECT substr(md5(random()::text), 1, 15),
       substr(md5(random()::text), 1, 15),
       md5(random()::text),
       'F',
       current_date
FROM generate_series(1, 1000000);

SELECT pg_size_pretty(pg_total_relation_size('tb_user'));

 pg_size_pretty 
----------------
 126 MB
(1 row)

A text or varchar that is short enough (which the ones in the table certainly are) will consume one byte more than the number of bytes in the string.

There is some overhead caused by block and tuples headers and other metadata, as well as by fragmentation (some empty space will remain in each 8kB block). And then the index that supports the primary key is also sizeable:

\di+ tb_user_pkey 
                            List of relations
 Schema  |     Name     | Type  |  Owner  |  Table  | Size  | Description 
---------+--------------+-------+---------+---------+-------+-------------
 laurenz | tb_user_pkey | index | laurenz | tb_user | 21 MB | 
(1 row)

Some additional hints:

  • You can save a little space by having user_crtime in the second position. Otherwise, you might lose up to three bytes to padding.

  • Use bigint for the primary key so that you can never run out of values. Trying to save on these 4 bytes is usually not a good idea.