Accurate PostgreSQL Database Size for Comparing Representations

I have a Java/JDBC program that takes a sample file and imports the data into the database, shredding it across multiple relations. The program does this multiple times for several different representations, one of which uses Large Objects. I can share more details of these representations but they are quite long and aren't relevant to this question since I'm looking for something generic.

I would like to compare the sizes of these different representations by examining the size of the database after each import. The database is on a PostgreSQL 9.4 local Windows server instance, with no other users and default configuration. Its only purpose is to conduct this test.

My initial plan was as follows:

for each representation {
  call VACUUM ANALYZE
  record old DB size with SELECT pg_tablespace_size('pg_default');
  import data into database
  call VACUUM ANALYZE
  record new DB size with SELECT pg_tablespace_size('pg_default');
  store storage cost as new DB size - old DB size
}

Obviously there are limitations to this approach, but my expectation is that for large files (~100MB) the reported storage costs should be reasonable approximations. Note that I use pg_tablespace_size in order to include the contribution of data outside of the main schema, such as large objects (in pg_catalog.pg_largeobject and pg_catalog.pg_largeobject_metadata).

I'm wondering whether this is a correct approach, and whether there is a better approach. I'm unsure whether VACUUM ANALYZE properly updates the stats used by pg_tablespace_size, even though it is called in the same session. It would also be better if I could avoid calling VACUUM ANALYZE, since this requires connecting as the superuser in order to run on the pg_catalog relations.

Any thoughts?

Best Answer

Just call pg_database_size(dbname) to know the size of the database.

VACUUM (without the FULL clause) does not free any space, it only marks it as reusable, and thus will not change the database's size (except in a rare boundary case, see Routine Vacuuming).

ANALYZE does statistical sampling and would be useful if you needed the row counts, but for the global db size, it's not necessary.

Best Answer

Related Solutions

PostgreSQL Initial Database Size – SQL Server and PostgreSQL Fragmentation

Postgresql – Slow access to table in postgresql despite vacuum

Related Question