Postgresql – Estimating the size (of the records) and the overhead in Postgres

cardinality-estimatespostgresqlpostgresql-9.4size;

Consider the following table in Postgres 9.4:

CREATE TABLE t
(
  a1 bigserial,
  a2 bigint NOT NULL,
  a3 bigint NOT NULL,
  a4 integer, 
  a5 timestamp with time zone NOT NULL,
  a6 timestamp with time zone NOT NULL DEFAULT now(),
  a7 bigint NOT NULL,
  a8 bigint NOT NULL,
  a9 real,
  a10 integer,

  CONSTRAINT kkkey PRIMARY KEY (a1)
)

What are the estimated costs to save this table?

A record costs:

size(bigserial) 
+ size(bigint) 
+ size(bigint) 
+ size(integer) 
+ size(timestamp) 
+ size(timestamp) 
+ size(bigint) 
+ size(bigint) 
+ size(real) 
+ size(integer)
= 8 + 8 + 8 + 4 + 8 + 8 + 8 + 8 + 4 + 4 = 68 bytes

Database Page Layout of Postgres gives rather detailed information how the records land on the secondary storage, but I am not sure how to get all the numbers together.

The Linux reports

blockdev --getbsz /dev/sda1
1024

Questions:

(1) Are there any helper functions to assess the storage costs per Row (so one does not need to do those complicated computations by hand)

(2) How to put the numbers together, i.e., estimate the overhead costs for each row?

(3) How to estimate the costs for the primary key index?

Best Answer

Functions that give the size of columns, tables, and indexes are documented in the manual: http://www.postgresql.org/docs/9.4/static/functions-admin.html

There is no function to calculate the size of an entire record (while there is a function to know the storage cost of an individual data value (pg_column_size)), since records are in general of variable length (sometimes they are compressed), so I think you have two possibilities, either perform a catalog query to sum up all the sizes of the columns of a table, or simply get the size of a populated table and divide by the numbers of records, so to have an average size for record.

Related Solutions

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Would a different kind of column be faster? For example an integer

No. timestamp and timestamptz are just unsigned 64-bit integers internally anyway.

Is there some way to not lock the column?

It doesn't lock the column. It takes weak table lock that doesn't really block anything except DDL, and takes a row level lock on the row you're updating.

There is no way to prevent the row level lock. It exists because without it behaviour and ordering concurrent updates would be undefined. We don't like undefined behaviour in RDBMSs.

It only blocks concurrent updates of the same row anyway.

Any other tips to improve this?

Not with the detail provided. There's likely a better way to do what you're trying to do, but it'll probably involve taking a few steps back and looking for a different strategy for solving the underlying problem.

In the specific case of cache invalidation I think you might want to look into LISTEN and NOTIFY. Again though, there just isn't enough info here to go on.

PostgreSQL – How to Delete Duplicate Records Efficiently

Core feature is the window function lag().
Also pay special attention to avoid deadlocks and race conditions with concurrent deletes and inserts (which can affect which rows to delete!):

CREATE OR REPLACE FUNCTION remove_vendor_price_dupes(_vendor int)
  RETURNS integer AS
$func$
DECLARE
   del_ct int;
BEGIN
   -- this may or may not be necessary:
   -- lock rows to avoid race conditions with concurrent deletes
   PERFORM 1
   FROM   vendor_prices
   WHERE  vendor = _vendor
   ORDER  BY sku, effective_date, id  -- guarantee row locks in consistent order
   FOR    UPDATE;

   -- delete redundant prices
   DELETE FROM vendor_prices v
   USING (
      SELECT id
           , price = lag(price) OVER w  -- same as last row
             AND (lead(id) OVER w) IS NOT NULL AS del  -- not last row
      FROM   vendor_prices
      WHERE  vendor = _vendor
      WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id)
      ) d
   WHERE v.id = d.id
   AND   d.del;

   GET DIAGNOSTICS del_ct = ROW_COUNT;  -- optional:
   RETURN del_ct;  -- return number of deleted rows
END
$func$  LANGUAGE plpgsql;

Call:

SELECT remove_vendor_price_dupes(1);

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

Best Answer

Related Solutions

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

PostgreSQL – How to Delete Duplicate Records Efficiently

Notes

Related Question