I have a Java/JDBC program that takes a sample file and imports the data into the database, shredding it across multiple relations. The program does this multiple times for several different representations, one of which uses Large Objects. I can share more details of these representations but they are quite long and aren't relevant to this question since I'm looking for something generic.
I would like to compare the sizes of these different representations by examining the size of the database after each import. The database is on a PostgreSQL 9.4 local Windows server instance, with no other users and default configuration. Its only purpose is to conduct this test.
My initial plan was as follows:
for each representation {
call VACUUM ANALYZE
record old DB size with SELECT pg_tablespace_size('pg_default');
import data into database
call VACUUM ANALYZE
record new DB size with SELECT pg_tablespace_size('pg_default');
store storage cost as new DB size - old DB size
}
Obviously there are limitations to this approach, but my expectation is that for large files (~100MB) the reported storage costs should be reasonable approximations. Note that I use pg_tablespace_size
in order to include the contribution of data outside of the main schema, such as large objects (in pg_catalog.pg_largeobject
and pg_catalog.pg_largeobject_metadata
).
I'm wondering whether this is a correct approach, and whether there is a better approach. I'm unsure whether VACUUM ANALYZE
properly updates the stats used by pg_tablespace_size
, even though it is called in the same session. It would also be better if I could avoid calling VACUUM ANALYZE
, since this requires connecting as the superuser in order to run on the pg_catalog
relations.
Any thoughts?
Best Answer
Just call
pg_database_size(dbname)
to know the size of the database.VACUUM
(without theFULL
clause) does not free any space, it only marks it as reusable, and thus will not change the database's size (except in a rare boundary case, see Routine Vacuuming).ANALYZE
does statistical sampling and would be useful if you needed the row counts, but for the global db size, it's not necessary.