PostgreSQL – How to Perform Partial Dump of Large Objects (pg_largeobject Table)

dumppg-dumppostgresqlpostgresql-9.1scripting

Context : Our web-app is like a collaboration platform where you can share files, appointments, survey and so on. An instance of this web-app is divided in communities, each one is independent to each other but share the same PostgreSQL database.

Target : actually what i'm trying to do is to write a shell script in order to make regular backups of communities separately. This means that I need to select only part of the data inside the various Postgre tables and export/dump the community data.

Obstacle :
My big problem is with the pg_largeobjects table, used to store the BLOB of files. This table use OIDs as pk (something I don't relly like by the way). The table files (containing files metadata) , has a reference to the pg_largeobject OID.

Solutions considered :

- Select all the OIDs from files table WHERE community_id = something.
- And afterwards, I could export the data stream from pg_largeobjects using [lo_export][1] function giving all the OIDs selected in the previous operation:
  
  int lo_export(PGconn *conn, Oid lobjId, const char *filename);

The problem with this approach is when i will try to restore pg_largeobject records, I need to keep the same OIDs or the relation will be broken. I can use the function lo_import_with_oid that permit to define the desired OID, but in most cases OIDs will be already occupied from the original data, and for this reason the import will generate a new OID, that will not satisfact the relation with the rest of the database.

I still don't know how I could achieve this, but i try to explain:
- during the export I could select and save all the PK of the files table record of the certain community (let's call it filesPKs).
- I will then perform a [lo_export] of pg_largeobjects
- and during the lo_import I could ignore the fact that new OID are being generated. But in someway i would save the list of these new OIDs inserted.
- as final step I could update all the rows(using filesPKs reference) in the files table in order to make them point to the new imported pg_largeobject records.

This second procedure in my opinion is very difficult, because I still don't see how could I update files rows in order to point to the correct largeobject.

An other option, that fancifully would be the best one, would be to modify the pg_largeobject table in order to have an additional column with a (static) reference to files table. But after a test it seems to be impossible:

testdb=# alter table pg_largeobject add column prova text; ERROR:
permission denied: "pg_largeobject" is a system catalog

Questions :

Any one else faced this issue?
Am I just missing something, or taking the wrong approach?
Out there exist an other way to export/dump pg_largeobjects records, exporting all the fields, not only the data stream?

Best Answer

I managed to resolve this issue as follow (2nd considered solution - revisited):

All in a while cicle, for every file:

select file_id , file_version, oid and put them in 3 variables
perform a [lo_export] of pg_largeobjects that require oid a path+filename to be defined. The filename will be the concatenation of the file_id and file_version.
during lo_import new OID will be generated. But using the filename defined above, I can perform the last stem:
update all the rows in the files table in order to make them point to the new imported pg_largeobject records and mantain the relation consistecy

Related Solutions

Postgresql – Rotate a table in PostgreSQL

What about ALTER TABLE RENAME followed by CREATE TABLE?

PostgreSQL Index Tuning – Compound Index on Large Tables

In a GiST index, the order of columns has a different significance than in a B-tree index. Per documentation:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.

In short: put the most selective columns first.

Your EXPLAIN output shows that the condition on pid is more selective (rows=7836) than the one on outline (rows=63112). If that can be generalized (a single example may be misleading) I suggest this alternative:

CREATE INDEX inventory_compound_idx ON portal.inventory USING gist (pid, outline);

If most of your (important) queries include conditions on both columns, a multicolumn index may serve you well. Else, single columns may be better overall.

Table layout

This is an educated guess since I don't have complete information.

Don't use oid as column name. It's easy to confuse with the OID.
Don't use the name date for a timestamp column. Or rather: don't use the name date for any column, don't use names of base-types for identifiers at all. Can lead to confusing mistakes and error messages.
Create a lookup table for types and only put a small integer type_id into the big table. Pack fixed-length types tightly so not to waste space to padding. Details.
I prefer the type text (or varchar without length limit) over varchar(n). Details.

For example:

CREATE TABLE portal.inventory (
   inventory_id bigint PRIMARY KEY
  ,type_id      integer NOT NULL REFERENCES inv_type(type_id)
  ,pid          integer NOT NULL
  ,size         bigint NOT NULL
  ,ts           timestamp NOT NULL
  ,outline      geography(Polygon,4326)
  ,product_name text
  ,path         text
);

Best Answer

Related Solutions

Postgresql – Rotate a table in PostgreSQL

PostgreSQL Index Tuning – Compound Index on Large Tables

Table layout

Related Question