PostgreSQL data files have size more than data itself

postgresqlstorage

We have a system that does some data archiving to a PostgreSQL DB. We found out that the PC storage was full due to the DB archiving. The problem is that I checked the data files residing in /var/lib/pgsql/data/base/ and they were about 70 GB in total, while when I dumped all the databases using pg_dump the output files did not exceed 24 GB. Am I missing something here or misunderstanding something? Where is this large difference in size going?

Edit: I did pg_dump to contain schema and data with the option -c to allow drop and create.

Edit 2: I investigated the DB schema file and I found out that table that contained almost 23.9 GB of the 24 GB (About 332.4 Milions rows) of data has an index on it. There is another index on another table but the table is empty.

Edit 3: The program stores values of about 1500 variables periodically, I mean all variables are recorded from 0.1 second to 1 minute or a bit more, so I think there is a huge DB access here.

Edit 4: I executed the second query here to find the size of each relation in the schema and I found out the following:

  • 28 GB for the main data table.
  • about 42 GB for 3 indexes only! 24, 9, 9.

My purpose is I want to do a backup and restore frequently (Every few months). Should I care about these DB indexes when doing backup and restore or just focus on my data tables?

Best Answer

Size can differ due to several reasons:

  • indexes take some disk space,
  • there can be several copies of the same record on disk,
  • slack space in pages.

Indexes take up disk space in order to facilitate faster lookup. The more indexes you have, the more disk space your DB will take up. GIN indexes are usually smaller, but aren't useful if you use range queries.

PostgreSQL supports concurrent access, which is implemented so, that updates and deletes on records only change the visibility of the records and not actually delete or overwrite the data, since one record may be used by another transaction. With update, a new (updated) copy gets then added. Both mean that old data is still written on disk. To free it up, PostgreSQL periodically performs vacuuming, which really removes the deleted records (after no transaction is using them).

PostgreSQL has default block size of 8KB. If your records are big, say 5KB, you can only get one record in one block, having fairly large (~3KB) slack.

Some possible solutions are:

  • Be sure to know how to check for disk usage in PostgreSQL,
  • reevaluate if you really need all the indexes,
  • think about what your data access pattern is - if it's append only table, then there should be very little invisible records. If on the other hand your data is changed for a time and gets archived after a period, then it's quote possible, that multiple records are on disk. If this is the case, then partitioning your table could help, since you can vacuum partitions separately.
  • you can vacuum the table manually. Don't use VACUUM FULL, since it will write a new copy of the table and requires appropriate amounts of disk space.
  • you can use some of the online (online as in you don't need to shut down the DB) repacking tools, such as pg_repack.

Edit:

pg_dump

pg_dump is okay, if you specify to dump data, as you've said you did. If you use custom format (-Fc flags), pg_restore will be able to do some extra stuff with it, such as load only specified tables, also see pg_dump's man page. Custom format gzips dump by default. This may slow your dumping, so you may want to disable that, and if you still want data gzipped, there's a parallel gzip (pigz).

Backups

When you backup, you backup data in tables. Indexes get recreated from data in tables. If you can afford heavy IO on the database when you are doing backups and restores, then pg_dump and pg_restore may suit your needs. When dumping, pg_dump only dumps tables, and when restoring, indexes get rebuilt automatically as part of restore. So to make it clear: backups don't care about indexes.

If heavy IO and degraded performance is not acceptable, then you might want to have a secondary replication server, which will have a copy of data, but won't serve your ordinary queries, so you can use it to make dumps.

If you need point in time restore capability, you could setup WAL log (write-ahead log) archiving, which then enables you to restore to a specific transaction, but this is quite advanced. There are tools to help, such as Barman.

Big indexes

As you've already discovered, indexes can take up a lot of space. If you add up your data and indexes, you get your DB size: 28GB + 42GB = 70GB.

Having big indexes means there's a lot of extra disk space used. Index data is also cached in memory, so having big indexes means you may have two copies of indexed data in RAM, which means there's less RAM for caching data on disk and you get more cache misses. There are some options to evaluate to make indexes smaller:

  • partial indexes: If the application only queries where a one field is set to a specific value, you can limit which rows are indexed with a WHERE clause on CREATE INDEX.
  • make use of index scan: If you can include all the queried data in an index, PostgreSQL can return by reading index data only. Works since PostgreSQL 9.2, see more about index scans on wiki.
  • use another type of index: If your application only does equality lookup (WHERE a = 4), GIN indexes can be a lot smaller, see btree_gin extension. PostgreSQL 9.5 will also bring BRIN indexes, which are really useful for fields, which have a monotonous increasing or decreasing pattern, such as timestamps and they are really really small, so you can free up both disk space for more storage and RAM for caching more data.

But ... a lot of this depends on how data is accessed, which is often only known by developer of the application.