PostgreSQL pg_dump – Table and Transitive Dependencies

pg-dumppostgresql

Sometimes I'd like to export just a single table and its dependencies, so I can do some experiments with it on a test database. Is there a way to do that, other than dumping the full db schema with pg_dump and any table dependencies manually?

For example let's say I have a table employees which has a single foreign key to companies.id and uses a few custom enums or other types. Let's also assume companies is self-contained, i.e. does not depend on anything else. So what I'd like to have is some way of dumping employees, both schema and all (or a subset) of the data and automatically get the required type definitons, and the schema for company as well as all referencered entries.

Is there a way to do this easily with pg_dump or some other tool?

Best Answer

uses a few custom enums or other types.

No. That's because those enum types can be used in different tables and such. PostgreSQL doesn't have a dependency tracker to make this work. The only exception to this that I know is sequences, -t will recreate sequences.

That said, if you recreate the database once, you only ever have to run pg_dump -t employees -c, the -c flag will drop the table and recreate it. This will unfortunately not work with foreign key constraints pointing to employees

Related Solutions

Postgresql – in postgres, can one optimize a table across partitions

the question is whether there is any way to optimize at the global level. for example, would it be possible to create an index on global.usr_id that would span all partitions?

Sadly the answer is "No" - global indexes are not currently supported

PostgreSQL – How to Speed Up Query with Joins

Based on some guess-simulations, I think you can slightly improve your query by:

Avoiding the outer DISTINCT clause (although there will be an implicitly DISTINCT).
Sub-selecting a part of the data so that less is needed to JOIN.

The query is as follows:

SELECT  
    employees.* 
FROM 
    employees 
WHERE
    employee_id IN
    (SELECT 
        -- Choose all employees from companies with website
        employee_id 
     FROM 
        employee_companies
        JOIN companies ON companies.company_id = employee_companies.company_id
     WHERE
        companies.website IS NOT NULL
    )
    -- Now filter only employees from 'Germany'
    AND employees.country = 'Germany' 
ORDER BY 
    employees.connections DESC ;

The data used to produce the simulation is the following one:

Table and index definitions:

CREATE TABLE employees
(
    employee_id integer PRIMARY KEY,
    country text,
    connections integer,
    something_else text
) ;

CREATE INDEX idx_employee_country 
   ON employees (country) ;

CREATE TABLE companies
(
    company_id integer PRIMARY KEY,
    website text,
    something_else text
) ;

CREATE INDEX not_empty_websites 
    ON companies(company_id, website) WHERE website IS NOT NULL ;

CREATE TABLE employee_companies
(
    employee_id integer NOT NULL REFERENCES employees(employee_id),
    company_id integer NOT NULL REFERENCES companies(company_id),
    PRIMARY KEY (employee_id, company_id)
) ;

CREATE INDEX company_employee
    ON employee_companies(company_id, employee_id) ;

1.000.000 companies (changing to 10M doesn't make a big difference). I assume 90% have a website.

INSERT INTO 
   companies
   (company_id, website)
SELECT
   generate_series(1, 1000000), 
   CASE WHEN random() > 0.1 THEN 'web.com' END AS website ;

80k employees (about 10% are Germans)

INSERT INTO
   employees 
   (employee_id, country, connections)
SELECT
    generate_series(1, 80000),
    case (random()*10)::integer
    when 0 then 'Germany'
    when 1 then 'United Kingdon'
    when 2 then 'United States'
    else 'Angola'
    end AS country,
    (random()*10)::integer AS connections ;

200K employees x companies (this means that people have worked in about 3 companies, on average):

INSERT INTO 
    employee_companies
    (employee_id, company_id)
SELECT DISTINCT
    (random()*79999)::integer + 1,
    (random()*999999)::integer + 1
FROM
    generate_series (1, 200000) ;

You can check a downsized version of this simulation at dbfiddle here. If this simulated data is sufficiently similar to your scenario, changing the query makes a 3x improvement with regard to server-execution time. I'd suggest you give it a try.

Simulating data (scaled down by a factor of 25) a scenario more similar to your real one doesn't offer such a nice increase in performance... Nevertheless, it improves by a 1.5 factor.

Check it at this dbfiddle

Best Answer

Related Solutions

Postgresql – in postgres, can one optimize a table across partitions

PostgreSQL – How to Speed Up Query with Joins

Related Question