PostgreSQL data files have size more than data itself

postgresqlstorage

We have a system that does some data archiving to a PostgreSQL DB. We found out that the PC storage was full due to the DB archiving. The problem is that I checked the data files residing in /var/lib/pgsql/data/base/ and they were about 70 GB in total, while when I dumped all the databases using pg_dump the output files did not exceed 24 GB. Am I missing something here or misunderstanding something? Where is this large difference in size going?

Edit: I did pg_dump to contain schema and data with the option -c to allow drop and create.

Edit 2: I investigated the DB schema file and I found out that table that contained almost 23.9 GB of the 24 GB (About 332.4 Milions rows) of data has an index on it. There is another index on another table but the table is empty.

Edit 3: The program stores values of about 1500 variables periodically, I mean all variables are recorded from 0.1 second to 1 minute or a bit more, so I think there is a huge DB access here.

Edit 4: I executed the second query here to find the size of each relation in the schema and I found out the following:

28 GB for the main data table.
about 42 GB for 3 indexes only! 24, 9, 9.

My purpose is I want to do a backup and restore frequently (Every few months). Should I care about these DB indexes when doing backup and restore or just focus on my data tables?

Best Answer

Size can differ due to several reasons:

indexes take some disk space,
there can be several copies of the same record on disk,
slack space in pages.

Indexes take up disk space in order to facilitate faster lookup. The more indexes you have, the more disk space your DB will take up. GIN indexes are usually smaller, but aren't useful if you use range queries.

PostgreSQL supports concurrent access, which is implemented so, that updates and deletes on records only change the visibility of the records and not actually delete or overwrite the data, since one record may be used by another transaction. With update, a new (updated) copy gets then added. Both mean that old data is still written on disk. To free it up, PostgreSQL periodically performs vacuuming, which really removes the deleted records (after no transaction is using them).

PostgreSQL has default block size of 8KB. If your records are big, say 5KB, you can only get one record in one block, having fairly large (~3KB) slack.

Some possible solutions are:

Be sure to know how to check for disk usage in PostgreSQL,
reevaluate if you really need all the indexes,
think about what your data access pattern is - if it's append only table, then there should be very little invisible records. If on the other hand your data is changed for a time and gets archived after a period, then it's quote possible, that multiple records are on disk. If this is the case, then partitioning your table could help, since you can vacuum partitions separately.
you can vacuum the table manually. Don't use VACUUM FULL, since it will write a new copy of the table and requires appropriate amounts of disk space.
you can use some of the online (online as in you don't need to shut down the DB) repacking tools, such as pg_repack.

Edit:

pg_dump

pg_dump is okay, if you specify to dump data, as you've said you did. If you use custom format (-Fc flags), pg_restore will be able to do some extra stuff with it, such as load only specified tables, also see pg_dump's man page. Custom format gzips dump by default. This may slow your dumping, so you may want to disable that, and if you still want data gzipped, there's a parallel gzip (pigz).

Backups

When you backup, you backup data in tables. Indexes get recreated from data in tables. If you can afford heavy IO on the database when you are doing backups and restores, then pg_dump and pg_restore may suit your needs. When dumping, pg_dump only dumps tables, and when restoring, indexes get rebuilt automatically as part of restore. So to make it clear: backups don't care about indexes.

If heavy IO and degraded performance is not acceptable, then you might want to have a secondary replication server, which will have a copy of data, but won't serve your ordinary queries, so you can use it to make dumps.

If you need point in time restore capability, you could setup WAL log (write-ahead log) archiving, which then enables you to restore to a specific transaction, but this is quite advanced. There are tools to help, such as Barman.

Big indexes

As you've already discovered, indexes can take up a lot of space. If you add up your data and indexes, you get your DB size: 28GB + 42GB = 70GB.

Having big indexes means there's a lot of extra disk space used. Index data is also cached in memory, so having big indexes means you may have two copies of indexed data in RAM, which means there's less RAM for caching data on disk and you get more cache misses. There are some options to evaluate to make indexes smaller:

partial indexes: If the application only queries where a one field is set to a specific value, you can limit which rows are indexed with a WHERE clause on CREATE INDEX.
make use of index scan: If you can include all the queried data in an index, PostgreSQL can return by reading index data only. Works since PostgreSQL 9.2, see more about index scans on wiki.
use another type of index: If your application only does equality lookup (WHERE a = 4), GIN indexes can be a lot smaller, see btree_gin extension. PostgreSQL 9.5 will also bring BRIN indexes, which are really useful for fields, which have a monotonous increasing or decreasing pattern, such as timestamps and they are really really small, so you can free up both disk space for more storage and RAM for caching more data.

But ... a lot of this depends on how data is accessed, which is often only known by developer of the application.

Pre-9.0 `VACUUM FULL`

You're on PostgreSQL 8.4 or older, where VACUUM FULL tends to bloat indexes. See this wiki page for details.

Don't run VACUUM FULL as a periodic maintenance task. It's unnecessary and inefficient. This remains true on current versions, it's just not as bad on 9.0 and above. If you feel the need to run VACUUM FULL regularly then you probably don't have autovacuum turned up far enough and are having table bloat issues. In fact, unless you've changed the FILLFACTOR on the table from its default 100 a VACUUM FULL is quite counter-productive; it'll compact away all the free space in the table, so following UPDATEs will have to extend the table.

Table extensions are currently one of the poorer performing operations in PostgreSQL, as they're controlled by a single global lock. So if you have tables that fluctuate in size, you really want to avoid constantly compacting and truncating them only to extend them again.

On some unusual workloads it can be worth running a periodic CLUSTER, which orders the table based on an index and effectively REINDEXes it. If you do many UPDATEs on the table should set a lower FILLFACTOR for efficiency.

If this table is being emptied and re-populated regularly, you should generally using TRUNCATE followed by COPY to fill it back up. If it's big, drop the indexes before the COPY then re-create them afterwards to produce indexes that are more compact and faster and to speed up the data load.

For one-off mitigation, CLUSTER the table or REINDEX it.

8.1?!?!

After edit added version: Holy bleepazoids, batman. 8.1.18? Forget what I said about autovacuum, autovacuum in 8.1 was way too ineffective. Upgrade to a sane version ASAP. You're not even on the current point release of 8.1, 8.1.23, from December 2010. 8.1.18 was released in September 2009! You need to begin your upgrade planning ... well, about two years ago, preferably. Read the release notes for every .0 version between 8.1 and the current release, focusing on the upgrade notes and compatibility notes. Then plan and execute your upgrade. If you don't feel up managing that on your own there are people who'll help you with it (I work for one of them) but honestly, the release notes and docs are quite sufficient for most people to do an upgrade themselves without undue pain.

Moving from 8.1 to 8.3 or newer will be your biggest pain point, as PostgreSQL 8.3 dropped a whole bunch of implicit casts that lots of potentially buggy SQL relied on. You'll need to test your application carefully on the newer version. Other changes to be aware of are:

The removal of implicit FROM and in later versions removal of the backwards compatibility parameter for it;
UTF-8 validation improvements in newer versions that can cause older dumps to fail to load until the data is corrected;
The change to standard_conforming_strings by default;
The change of bytea_output to hex

Best Answer

Related Solutions

Oracle extents – How to have so many with this small size

Postgresql – Indexes Size 20 times bigger then the table itself and very slow queries

Pre-9.0 VACUUM FULL

8.1?!?!

Related Question

Pre-9.0 `VACUUM FULL`