Postgresql – Proper storage size estimation technique for PostgreSQL

database-sizepostgresqlscalability

We are preparing a PostgreSQL database for production usage and we need to estimate the storage size for this database. We are a team of developers with low expertise on database administration, so we are doing research, reading manuals and using our general information technology knowledge to achieve this.

We have actual data to migrate to this database and have some rough estimations of growth. For the sake of the example, let's say we have a estimation of growth of 50% per year.

The point is: what's the general proper technique for doing a good size estimation?

We are estimating the storage usage by the following rules. Topics where we need advice are marked with bold text. Feedback on the whole process is more than welcome:

  1. Estimate the size of each table
    1. Discover the actual size of each row.
      • For fields with a fixed size (like bigint, char, etc) we used the sizes described in the documentation
      • For fields with a dynamic size (like text) we estimated the string length and used the function select pg_column_size('expected text here'::text)
      • We added 4 more bytes for the OID that PostgreSQL uses internally
    2. Multiply the size of each row by the number of estimated rows
    3. Do I need to consider any overhead here, like row or table metadata?
  2. Estimate the size of each table index
    • Don't know how to estimate this, need advice here
  3. Estimate the size of the transaction log
    • Don't know how to estimate this, need advice here
  4. Estimate the size of the backups (full and incremental)
    • Don't know how to estimate this, need advice here
  5. Sum all the estimates for the actual minimum size

  6. Apply a factor of 1.5x (the 50% growth) to the sum of the estimates 1, 2 and 4 for the minimum size after 1 year

  7. Apply an overall factor of 1.2 ~ 1.4 (20% to 40% more) to estimates 5 and 6 for a good safety margin

I know the rules got pretty extensive. Let me know if examples are necessary for a better understanding.

Best Answer

For point 1), you need to read the Storage Page Layout chapter of the documentation and in particular consider the HeapTupleHeaderData Layout table for the metadata at the row level.

The 4-bytes per-row OID is obsolete for user tables. PostgreSQL no longer have them by default since 8.1. This is now controlled by the default_with_oids config parameter or the WITH(OIDS) clause of CREATE TABLE.

When a significant part of the data is live (gets frequently updated), estimating the disk size is harder because UPDATEs are equivalent to INSERT new version followed by DELETE old version, followed hopefully by reuse of the unused space whenever possible. There is also the fillfactor storage option for tables that come into play here (see the storage parameters section in CREATE TABLE)

And there is bloat at the index level, too. For write-heavy tables, it's not uncommon for an index to be several times bigger than its optimal size.

You may need to learn a bit about VACUUM to figure out how much you may be affected by table and index bloat in your specific usage.

The size of the transaction logs entirely depends on the amount of writes to the database.