PostgreSQL: Separate tables vs single table to perserve disk space

database-sizedatatypesdisk-spacepostgresqlschema

I have 2 tables with the below schemas with equal number of rows. When I run the SELECT relname, relpages FROM pg_class ORDER BY relpages DESC command, they show up as 23GB each even though the data types (total number of bytes) for thew two schemas result in different sizes . What is the best way to figure out how much space I would save by combining tables? Also, is there any way to determine, how much space each row is actually taking up?

                                 Table "public.table1"
Column        |           Type           |                      Modifiers                      
--------------+--------------------------+-----------------------------------------------------

 field1       | smallint                 | 
 field2       | smallint                 | 
 field3       | integer                  | 
 field4       | smallint                 | 
 timestamp    | timestamp with time zone | 
 user_id      | integer                  | 
 status       | boolean                  | 
 id           | integer                  | not null default 

                                  Table "public.table2"
 Column          |           Type           |                     Modifiers                      
 ----------------+--------------------------+----------------------------------------------------
 user_id         | integer                  | 
 begin_timestamp | timestamp with time zone | 
 end_timestamp   | timestamp with time zone | 
 field           | smallint                 | not null
 id              | integer                  | not null default

Best Answer

A reasonable answer to this must IMHO ask (or at least imply) the question: Why were the tables split to begin with? (More importantly: Does it apply still?)

The general case: I do not know the specifics of how the postgres db engine works, but the lower bound of saved disk space may be no larger than the size of the columns removed + indexes of duplicate variables + a bit of housekeeping data

The upper bound is also quite small, so it's probably not going to be worth the trouble; it may even be that they were partitioned for performance reasons - because some columns would be accessed (read, or even updated) more often, and as such it could be made small (fewer columns) to make it faster. (At some penalty for accessing columns from both at the same time.)

The specific case: It looks like there's a lot of redundant fields between them. user_id, for example, would not need to be there twice. begin/end/X timestamps could perhaps be reduced to two or one. So there could be some slight saving from that, depending on the application's requirements. Again, try to figure out why they did it to begin with.

Overall, though, I agree with Bryan Agee; disk space should probably not be your concern on anything big enough to run postgres to begin with. Particularly if you get paid for the time you spend on "fixing" this, that cost is likely to exceed the cost of the bigger disk by itself.

Per-row space: I'm not well versed in postgres particulars, so someone in the know should be able to correct anything wrong. Such as there being way to ask postgres how much space that specific row is actually taking up; I don't know any. What I'm writing is basically the theory behind database storage as (I believe) commonly done today.

Each field has a bit (in a byte of its own, or in a bitset common for the whole row) which says if the value is a null or not. If it's a null value, there's nothing more stored. Then there's a length byte - unless this is implicit from fixed-width data types. Then there's the data itself.

Thus, a row of one int (even 64-bit) and three null values may take as little as 3 bytes. (For values <127, storing size=1 and the byte.) Plus indexes, and various other housekeeping metadata. Again, I don't know just how far postgres goes in this regard. And these factors all taken together will tend to make "how much does this row occupy" a question with a useless answer.

AFAIK postgres also operates with "pages" - blocks of space that a row may not be stored across the boundaries of. So, larger records may more often end up "just not fitting" in a page and so need to be placed in another/new page.

Related Solutions

MySQL row length is almost the triple of what I expect

I don't think I've ever come across any description of how ndbcluster affects row layout. (I haven't looked hard, though. Also haven't read the source code.) But, for example, innodb tables have a well-documented row layout. (Innodb row structure.) You can find quite a few things that affect the overall length of a row.

Postgresql – How to reclaim disk space on PostgreSQL

Although you did not state it, I assume from your references to documents that you have followed that you have done a VACUUM FULL on the database and/or affected tables. You also didn't specify what postgresql version you are using - I will assume it is > 9.0 (VACUUM FULL behaved differently before this).

VACUUM FULL will rewrite affected tables into new files, then delete the old files. However, if any process still has the old file open, the operating system will not actually delete the file - until the last process has closed it.

If practical, restarting the database would ensure that all open files get closed.

If that is not practical, then you may be able to verify if this is your problem, and find out which process has the files open.

If using Linux (or most other Unix-like systems) you can use the 'lsof' command to get a listing of all files open in all processes. Files which are open but which have since been deleted will have '(deleted)' appended to the filename. So, you can grep the output of lsof, looking for deleted files, like this:

sudo lsof -u postgres | grep 'deleted'

If that identifies processes which still have the old files open, you can use pg_terminate_backend to terminate that process:

SELECT pg_terminate_backend(xxx);

where xxx is the PID of the process, found in the lsof output.

If using Windows, the same principle could apply, because postgres opens files using the FILE_SHARE_DELETE flag, which allows it to delete files that are open in another process. The 'handle' command is the rough equivalent of lsof, although I am not sure if you can tell if the files are deleted or not so some additional work might be required.

It is another question as to why any such processes would be hanging on to old file handles. However in the thread you quoted in your question, Tom Lane seems to imply that it can happen.

Best Answer

Related Solutions

MySQL row length is almost the triple of what I expect

Postgresql – How to reclaim disk space on PostgreSQL

Related Question