Based on some guess-simulations, I think you can slightly improve your query by:
- Avoiding the outer
DISTINCT
clause (although there will be an implicitly DISTINCT
).
- Sub-selecting a part of the data so that less is needed to
JOIN
.
The query is as follows:
SELECT
employees.*
FROM
employees
WHERE
employee_id IN
(SELECT
-- Choose all employees from companies with website
employee_id
FROM
employee_companies
JOIN companies ON companies.company_id = employee_companies.company_id
WHERE
companies.website IS NOT NULL
)
-- Now filter only employees from 'Germany'
AND employees.country = 'Germany'
ORDER BY
employees.connections DESC ;
The data used to produce the simulation is the following one:
Table and index definitions:
CREATE TABLE employees
(
employee_id integer PRIMARY KEY,
country text,
connections integer,
something_else text
) ;
CREATE INDEX idx_employee_country
ON employees (country) ;
CREATE TABLE companies
(
company_id integer PRIMARY KEY,
website text,
something_else text
) ;
CREATE INDEX not_empty_websites
ON companies(company_id, website) WHERE website IS NOT NULL ;
CREATE TABLE employee_companies
(
employee_id integer NOT NULL REFERENCES employees(employee_id),
company_id integer NOT NULL REFERENCES companies(company_id),
PRIMARY KEY (employee_id, company_id)
) ;
CREATE INDEX company_employee
ON employee_companies(company_id, employee_id) ;
1.000.000 companies (changing to 10M doesn't make a big difference). I assume 90% have a website.
INSERT INTO
companies
(company_id, website)
SELECT
generate_series(1, 1000000),
CASE WHEN random() > 0.1 THEN 'web.com' END AS website ;
80k employees (about 10% are Germans)
INSERT INTO
employees
(employee_id, country, connections)
SELECT
generate_series(1, 80000),
case (random()*10)::integer
when 0 then 'Germany'
when 1 then 'United Kingdon'
when 2 then 'United States'
else 'Angola'
end AS country,
(random()*10)::integer AS connections ;
200K employees x companies (this means that people have worked in about 3 companies, on average):
INSERT INTO
employee_companies
(employee_id, company_id)
SELECT DISTINCT
(random()*79999)::integer + 1,
(random()*999999)::integer + 1
FROM
generate_series (1, 200000) ;
You can check a downsized version of this simulation at dbfiddle here. If this simulated data is sufficiently similar to your scenario, changing the query makes a 3x improvement with regard to server-execution time. I'd suggest you give it a try.
Simulating data (scaled down by a factor of 25) a scenario more similar to your real one doesn't offer such a nice increase in performance... Nevertheless, it improves by a 1.5 factor.
Check it at this dbfiddle
Best Answer
No. That's because those enum types can be used in different tables and such. PostgreSQL doesn't have a dependency tracker to make this work. The only exception to this that I know is sequences,
-t
will recreate sequences.That said, if you recreate the database once, you only ever have to run
pg_dump -t employees -c
, the-c
flag will drop the table and recreate it. This will unfortunately not work with foreign key constraints pointing toemployees