Postgresql – Why does this tsvector update use so much space

disk-spacepostgresqlupdate

This is my table structure:

  Column     |           Type           | Collation | Nullable |               Default                | Storage  | Stats target | Description 
-------------+--------------------------+-----------+----------+--------------------------------------+----------+--------------+-------------
 id          | integer                  |           | not null | nextval('tbl_disk_id_seq'::regclass) | plain    |              | 
 inserttime  | timestamp with time zone |           | not null | now()                                | plain    |              | 
 buildid     | text                     |           | not null |                                      | extended |              | 
 app_version | text                     |           | not null |                                      | extended |              | 
 device      | text                     |           | not null |                                      | extended |              | 
 device_api  | text                     |           | not null |                                      | extended |              | 
 userid      | text                     |           | not null |                                      | extended |              | 
 path        | text                     |           | not null |                                      | extended |              | 
 size        | numeric                  |           | not null |                                      | main     |              | 
 type        | integer                  |           | not null |                                      | plain    |              | 
 root        | integer                  |           | not null |                                      | plain    |              | 
 status      | integer                  |           | not null | 0                                    | plain    |              | 

I used this query to check the size of the table:

SELECT  relname AS "relation",
  pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'                                                   
    AND nspname !~ '^pg_toast'
    AND relname like 'tbl_disk%'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 20;

The result:

       relation        | total_size 
-----------------------+------------
 tbl_disk_20180830     | 90 MB
 tbl_disk_20180831     | 32 MB

Now I add a new column document with type tsvector:

ALTER TABLE tbl_disk ADD "document" tsvector;

and update the data:

UPDATE tbl_disk 
SET document = array_to_tsvector(ARRAY[split_part(path,'/',2),split_part(path,'/',6)]);
path data like : /data/user/0/com.abc.abc/database/olala.db

I just need two tokens 'data' and 'database' , that is why I split data.

After successfully updating, the size approximately doubles:

       relation        | total_size 
-----------------------+------------
 tbl_disk_20180830     | 175 MB
 tbl_disk_20180831     | 61 MB

The data :

                          path                          |      document      
--------------------------------------------------------+--------------------
 /data/user/0/com.abc.abc/lib/libabcqlite.so          | 'data' 'lib'
 /data/user/0/com.abc.abc/files/ddinfo2               | 'data' 'files'
 /data/user/0/com.abc.abc/lib/libabcsqlite.so         | 'data' 'lib'
 /data/user/0/com.abc.abc/lib/libabcsqlite.so         | 'data' 'lib'
 /data/user/0/com.abc.abc/shaders/b32                 | 'data' 'shaders'

You can see the size of tbl_disk_20180830 increases from 90MB to 175MB.

I don't know why it needs such a lot of space to save two tokens?

Best Answer

Databases cannot simply overwrite data "in place". If they did, they wouldn't be able to have atomicity or isolation, and probably not much concurrency and durability either.

When you do a bulk update of every row in a table, in PostgreSQL, the resulting table has a copy of both the old data and the new data. You can rid of the copy of the old data by running VACUUM FULL <table name>. Or you can just do a regular VACUUM, but if you do that it will not shrink the amount of space it takes up, it will just mark the space as being re-usable for future work on that table.