Postgresql – pg_database_size() is small,but pg_dump is very big

postgresql

I faced a strange question.
when I use the psql command " \l+ db_name" or following sql,

cattle_dev=# \l+ cattle_dev                                        
                                                    List of databases
    Name    |   Owner    | Encoding |   Collate   |    Ctype    | Access privileges |  Size   | Tablespace | Description 
------------+------------+----------+-------------+-------------+-------------------+---------+------------+-------------
 cattle_dev | cattle_dev | UTF8     | zh_CN.UTF-8 | zh_CN.UTF-8 |                   | 1738 MB | pg_default | 
(1 row)

cattle_dev=# select pg_size_pretty(pg_database_size('cattle_dev'));
 pg_size_pretty 
----------------
 1738 MB
(1 row)

cattle_dev=#

it tells that the database is small.
But,if i use the pg_dump to backup this database,the result is too large,it's almost 18GB.

[enterprisedb@ppasdev 20170605012443]$ ll -h cattle_dev_20170605012443.dmp 
-rw-rw-r--. 1 enterprisedb enterprisedb 18G Jun  5 01:34 cattle_dev_20170605012443.dmp
[enterprisedb@ppasdev 20170605012443]$

The problem is why the pg_dump result is too large,but the pg_database_size is small?

any help would be appreciated.thanks.

Best Answer

One of them is the binary size on disk. PostgreSQL compresses on disk all TOASTable fields over 2k (like text). Think about a binary value: they usually represent less space on disk than the same value CASTed to text anyway.

SELECT b::text, pg_column_size(b) AS on_disk, length(b::text) AS text
FROM ( VALUES (now()) ) AS t(b)
UNION ALL 
  SELECT b::text, pg_column_size(b), length(b::text)
  FROM ( VALUES (49839489::int) ) AS t(b)
UNION ALL
  SELECT b::text, pg_column_size(b), length(b::text)
  FROM ( VALUES ('192.168.43.58'::inet) ) AS t(b);
               b               | on_disk | text 
-------------------------------+---------+------
 2017-06-05 20:56:45.978472-05 |       8 |   29
 49839489                      |       4 |    8
 192.168.43.58/32              |      10 |   16
(3 rows)

In addition, you're going to have overhead merely because the SQL, schema, and input format (being CSV or whatever).

You may want to look into the -z option. (Or, xy if you need the best compression).

-Z 0..9
--compress=0..9
  Specify the compression level to use. Zero means no compression. For the custom archive format, this specifies compression of individual table-data
  segments, and the default is to compress at a moderate level. For plain text output, setting a nonzero compression level causes the entire output file
  to be compressed, as though it had been fed through gzip; but the default is not to compress. The tar archive format currently does not support
  compression at all.

Related Solutions

Postgresql – How to take pg_dump of a very large postgres database

Hello you need to change max_locks_per_transaction in the postgresql.conf file typically located in /var/lib/pgsql/data or wherever you installed the data directory for PostgreSQL.

Which was already answered here

You will have to restart the PostgreSQL Service.

The answer above also talks about having to possibly increase shared memory. Here is a link to a PostgreSQL article on that subject.

Mysql – faster, one big query or many small queries

What would address your question is the subject JOIN DECOMPOSITION.

According to Page 209 of the Book

High Performance MySQL

You can decompose a join by running multiple single-table queries instead of a multitable join, and then performing the join in the application. For example, instead of this single query:

SELECT * FROM tag
JOIN tag_post ON tag_post.tag_id = tag.id
JOIN post ON tag_post.post_id = post.id
WHERE tag.tag = 'mysql';

You might run these queries:

SELECT * FROM tag WHERE tag = 'mysql';
SELECT * FROM tag_post WHERE tag_id=1234;
SELECT * FROM post WHERE post.id IN (123,456,567,9098,8904);

Why on earth would you do this ? It looks wasteful at first glance, because you've increased the number of queries without getting anything in return. However, such restructuring can actually give significant performance advantages:

Caching can be more efficient. Many applications cache "objects" that map directly to tables. In this example, if the object with the tag mysql is already cached, the application will skip the first query. If you find posts with an ID of 123, 567, or 908 in the cache, you can remove them from the IN() list. The query cache might also benefit from this strategy. If only one of the tables changes frequently, decomposing a join can reduce the number of cache invalidations.
Executing the queries individually can sometimes reduce lock contention
Doing joins in the application makes it easier to scale the database by placing tables on different servers.
The queries themselves can be more efficient. In this example, using an IN() list instead of a join lets MySQL sort row IDs and retrieve rows more optimally than might be possible with a join.
You can reduce redundant row accesses. Doing a join in the application means retrieving each row only once., whereas a join in the query is essentially a denormalization that might repeatedly access the same data. For the same reason, such restructuring might also reduce the total network traffic and memory usage.
To some extent, you can view this technique as manually implementing a hash join instead of the nested loops algorithm MySQL uses to execute a join. A hash join might be more efficient.

As a result, doings joins in the application can be more efficient when you cache and reuse a lot of data from earlier queries, you distribute data across multiple servers, you replace joins with IN() lists, or a join refers to the same table multiple times.

OBSERVATION

I like the first bulletpoint because InnoDB is a little heavy-handed when it crosschecks the query cache.

Sep 05, 2012 : Is the overhead of frequent query cache invalidation ever worth it?
Jun 07, 2014 : Why query_cache_type is disabled by default start from MySQL 5.6?

As for the last bulletpoint, I wrote a post back on Mar 11, 2013 (Is there an execution difference between a JOIN condition and a WHERE condition?) that describes the nested loop algorithm. After reading it, you will see how good join decomposition may be.

As for all other points from the book, the developers really look for performance as the bottom line. Some rely on external means (outside of the application) for performance enhancements such as using a fast disk, get more CPUs/Cores, tuning the storage engine, and tuning the configuration file. Others will buckle down and write better code. Some may resort to coding all the business intelligence in Stored Procedures but still not apply join decomposition (See What are the arguments against or for putting application logic in the database layer? along with the other posts). It's all up to the culture and tolerance of each developer shop.

Some may be satisfied with performance and not touch the code anymore. Other simply don't realize there are great benefits one can reap if they try join composition.

For those developers that are willing ...

Best Answer

Related Solutions

Postgresql – How to take pg_dump of a very large postgres database

Mysql – faster, one big query or many small queries

OBSERVATION

GIVE IT A TRY !!!

Related Question