Postgresql – Proper storage size estimation technique for PostgreSQL

database-sizepostgresqlscalability

We are preparing a PostgreSQL database for production usage and we need to estimate the storage size for this database. We are a team of developers with low expertise on database administration, so we are doing research, reading manuals and using our general information technology knowledge to achieve this.

We have actual data to migrate to this database and have some rough estimations of growth. For the sake of the example, let's say we have a estimation of growth of 50% per year.

The point is: what's the general proper technique for doing a good size estimation?

We are estimating the storage usage by the following rules. Topics where we need advice are marked with bold text. Feedback on the whole process is more than welcome:

Estimate the size of each table
1. Discover the actual size of each row.
  - For fields with a fixed size (like bigint, char, etc) we used the sizes described in the documentation
  - For fields with a dynamic size (like text) we estimated the string length and used the function select pg_column_size('expected text here'::text)
  - We added 4 more bytes for the OID that PostgreSQL uses internally
2. Multiply the size of each row by the number of estimated rows
3. Do I need to consider any overhead here, like row or table metadata?
Estimate the size of each table index
- Don't know how to estimate this, need advice here
Estimate the size of the transaction log
- Don't know how to estimate this, need advice here
Estimate the size of the backups (full and incremental)
- Don't know how to estimate this, need advice here
Sum all the estimates for the actual minimum size
Apply a factor of 1.5x (the 50% growth) to the sum of the estimates 1, 2 and 4 for the minimum size after 1 year
Apply an overall factor of 1.2 ~ 1.4 (20% to 40% more) to estimates 5 and 6 for a good safety margin

I know the rules got pretty extensive. Let me know if examples are necessary for a better understanding.

Best Answer

For point 1), you need to read the Storage Page Layout chapter of the documentation and in particular consider the HeapTupleHeaderData Layout table for the metadata at the row level.

The 4-bytes per-row OID is obsolete for user tables. PostgreSQL no longer have them by default since 8.1. This is now controlled by the default_with_oids config parameter or the WITH(OIDS) clause of CREATE TABLE.

When a significant part of the data is live (gets frequently updated), estimating the disk size is harder because UPDATEs are equivalent to INSERT new version followed by DELETE old version, followed hopefully by reuse of the unused space whenever possible. There is also the fillfactor storage option for tables that come into play here (see the storage parameters section in CREATE TABLE)

And there is bloat at the index level, too. For write-heavy tables, it's not uncommon for an index to be several times bigger than its optimal size.

You may need to learn a bit about VACUUM to figure out how much you may be affected by table and index bloat in your specific usage.

The size of the transaction logs entirely depends on the amount of writes to the database.

Related Solutions

Sql-server – PostgreSQL Initial Database Size

No the only thing close to that is when you compile the server with the --with-segsize switch, this might help if your table is taking up more space than a gig and your file system can handle a single file being over a gig. If your inserting 20 gigs it will have to create 20 files if you don't use this switch. If your file system can handle a file over a gig you can just set it to a large value most likely see some benefit, worst case a small benefit.
Take a look at CLUSTER http://www.postgresql.org/docs/9.1/static/sql-cluster.htmland FILLFACTOR http://www.postgresql.org/docs/9.1/static/sql-createtable.html, http://www.postgresql.org/docs/9.1/static/sql-createindex.html

Note that FILLFACTOR can be applied to both tables and indexes.

Estimate a Database Size

Ideally, you would create the database, load a bit of sample data, measure the size, and extrapolate. That is, by far, the more accurate method of estimating the size of a database in 5 years.

If you do want to compute the database size, you would generally start by figuring out how many rows will fit in a single block. For simplicity, we'll assume that rows are never deleted and that updates never change the size of the row. We'll also assume no compression is being used. Otherwise, things get a bit (more) complicated.

Calculate the size of the data in the row. For fixed-size data types (i.e. DATE, CHAR), that's just the size of the type. For variable-size data types (i.e. NUMBER, VARCHAR2), that's the average size of the data in the column. There are a couple of bytes of additional overhead, but you can pretty safely ignore that-- they're going to be swamped by the errors in estimating the size of your actual data and in the subsequent estimate of the number of rows per block.

If you expect that each row will have x bytes of data, the number of rows per block will be

<<rows per block>> = 
  floor( <<database block size>> * 
          (1 - <<pctfree of table>>/100) / 
          <<size of row>> ) + 1

assuming that a row is smaller than <<database block size>> * <<pctfree of table>>/100. If the row is larger, then don't add 1 to the result of the floor.

Once you know the number of rows per block, the estimated size of the table will be

<<size of table>> = 
  ceil( <<number of rows in table>> / <<rows per block>> ) *
    <<database block size>>

Walking through an example

My database block size is 8k and we'll assume that I'm using the default PCTFREE of 10 (meaning 10% of the block is reserved for future updates that increase the size of the row). I'll create a simple two-column table

SQL> create table foo(
  2    foo_id number,
  3    foo_str varchar2(100)
  4  );

Table created.

If foo_id is going to be the primary key with values from 1 to 1 million, each foo_id will consume between 1 and 7 bytes of space. But I also know from doing the test that, on average, it'll take ~6 bytes (actually 5.89 bytes). Of course, the larger the foo_id values get, the more space, on average, each foo_id requires. Oracle needs, on average 1.1 bytes per element to store the numbers 1-10, 1.92 bytes to store 1-100, 2.89 bytes to store 1-1,000, 3.89 bytes to store 1-10,000, 4.89 bytes to store 1-100,000, and 5.89 bytes to store 1-1,000,000. So, let's estimate for our example that foo_id will require 6 bytes and foo_str will require 50 bytes because the average foo_str is roughly 50 bytes. So we'll estimate a row size of 56 bytes.

The number of rows per block

<<rows per block>> = 
  floor( 8192 * 
          (1 - 10/100) / 
          56 ) + 1

which works out to 132 rows per block. If we want to estimate the size of a 1 million row table,

<<size of table>> = 
  ceil( 1000000 / 132 ) *
    8192

which works out to 59.19 MB.

Now, let's test our estimate

We'll insert 1 million rows where foo_id goes from 1 to 1,000,000 and foo_str is a string with a random length between 1 and 100.

SQL> ed
Wrote file afiedt.buf

  1  insert into foo
  2    select level, dbms_random.string( 'p', dbms_random.value(1,100))
  3      from dual
  4*  connect by level <= 1000000
SQL> /

1000000 rows created.

Our estimate of the average row length was spot on (note that, in reality, you won't be nearly this close-- your estimate of variable column sizes will not be nearly so accurate)

SQL> exec dbms_stats.gather_table_stats( 'SCOTT', 'FOO' );

PL/SQL procedure successfully completed.
SQL> ed
Wrote file afiedt.buf

  1  select avg_row_len, num_rows
  2    from user_tables
  3*  where table_name = 'FOO'
SQL> /

AVG_ROW_LEN   NUM_ROWS
----------- ----------
         56    1000000

But how big is the actual table? The most common measure is to look at the size of the segment which is 72 MB.

SQL> select sum(bytes)/1024/1024 mb
  2    from user_segments
  3   where segment_name = 'FOO';

        MB
----------
        72

Our guess was off by ~20% and that was when we were perfect on our estimate of the size of a row. That's because Oracle allocates space to tables in chunks called extents that we've ignored. There are different algorithms for this that depend on the setup of the tablespace. Assuming recent Oracle versions where all tables are in locally managed tablespaces, you would be choosing between uniform extent allocation and automatic extent allocation. In my example, the tablespace I'm using is using automatic extent allocation. The exact algorithm for that, in turn, may depend on the version of Oracle you're using. In my case, though, the first 16 extents are 64 kb, the next 63 extents are 1 MB, and the last extent is 8 MB

SQL> ed
Wrote file afiedt.buf

  1  select bytes, count(*)
  2    from user_extents
  3   where segment_name = 'FOO'
  4   group by bytes
  5*  order by bytes
SQL> /

     BYTES   COUNT(*)
---------- ----------
     65536         16
   1048576         63
   8388608          1

That means that I probably got a bit unlucky and I had just a bit more data than would fit in 79 extents totaling 64 MB so I had to allocate an 80th extent that was 8 MB in size, for a total of 72 MB. We can use the dbms_space package to get more details about how much space is being used. When we do that, we see that we're actually only using 66.22 MB of the 72 MB that have been allocated. So our actual estimation error is really only ~10%

SQL> ed
Wrote file afiedt.buf

  1  DECLARE
  2   l_space_used NUMBER;
  3   l_space_allocated NUMBER;
  4   l_chained_pct NUMBER;
  5  BEGIN
  6    dbms_space.object_space_usage(
  7      'SCOTT',
  8      'FOO',
  9      'TABLE',
 10      NULL,
 11      l_space_used,
 12      l_space_allocated,
 13      l_chained_pct);
 14    dbms_output.put_line('Space Used: ' || TO_CHAR(round(l_space_used/1024/1024,2)) || ' MB');
 15    dbms_output.put_line('Space Allocated: ' || TO_CHAR(l_space_allocated/1024/1024) || ' MB');
 16    dbms_output.put_line('Chained Percentage: ' || TO_CHAR(l_chained_pct));
 17* END;
SQL> /
Space Used: 66.22 MB
Space Allocated: 72 MB
Chained Percentage: 0

PL/SQL procedure successfully completed.

Is there a shortcut?

If you're using a recent version of Oracle, you can use the dbms_space.create_table_cost procedure to estimate the size of a table. There are a couple ways of doing this. The first option is to pass in the size of the row. Using our 56 byte estimate, that yields a table size estimate of 64 MB with 63.52 MB used which is pretty close

SQL> ed
Wrote file afiedt.buf

  1  DECLARE
  2   l_used_bytes NUMBER;
  3   l_allocated_bytes NUMBER;
  4  BEGIN
  5    dbms_space.create_table_cost('USERS',
  6                                 56,
  7                                 1000000,
  8                                 10,
  9                                 l_used_bytes,
 10                                 l_allocated_bytes);
 11    dbms_output.put_line('Used Bytes: ' || TO_CHAR(round(l_used_bytes/1024/1024,2)) || ' MB');
 12    dbms_output.put_line('Alloc Bytes: ' || TO_CHAR(l_allocated_bytes/1024/1024) || ' MB');
 13* END;
SQL> /
Used Bytes: 63.52 MB
Alloc Bytes: 64 MB

PL/SQL procedure successfully completed.

You can also specify the data types of the columns that will be in your table. Since there is no way to specify the average size, though, this tends not to be nearly as accurate as doing it yourself. In our case, though, it's pretty good because our VARCHAR2 column happened to be populated with strings that were, on average, half the maximum size of the column. In our case, though, it correctly estimated the allocated size of 72 MB with 67.94 MB used.

SQL> ed
Wrote file afiedt.buf

  1  DECLARE
  2   l_used_bytes NUMBER;
  3   l_allocated_bytes NUMBER;
  4   l_cols sys.create_table_cost_columns;
  5  BEGIN
  6    l_cols := sys.create_table_cost_columns(
  7            sys.create_table_cost_colinfo('NUMBER',10),
  8            sys.create_table_cost_colinfo('VARCHAR2',100));
  9    dbms_space.create_table_cost('USERS',
 10                                 l_cols,
 11                                 1000000,
 12                                 10,
 13                                 l_used_bytes,
 14                                 l_allocated_bytes);
 15    dbms_output.put_line('Used Bytes: ' || TO_CHAR(round(l_used_bytes/1024/1024,2)) || ' MB');
 16    dbms_output.put_line('Alloc Bytes: ' || TO_CHAR(l_allocated_bytes/1024/1024) || ' MB');
 17* END;
SQL> /
Used Bytes: 67.94 MB
Alloc Bytes: 72 MB

PL/SQL procedure successfully completed.

Best Answer

Related Solutions

Sql-server – PostgreSQL Initial Database Size

Estimate a Database Size

Related Question