I have a couple of tables that I need to join. I have an employees table (~ 400K rows), a companies table (~10 million rows) and a employee_companies table which stores where someone works.
Basically, I need to get all the employees that match some conditions (they work on a company that has a website, are located in a certain country, etc). I made a query to get this, but it's taking too long. I need to speed it up.
SELECT DISTINCT "employees".*
FROM "employees"
INNER JOIN "employee_companies" ON "employee_companies"."employee_id" = "employees"."id"
INNER JOIN "companies" ON "companies"."id" = "employee_companies"."company_id"
WHERE (employee_companies.employee_id IS NOT NULL)
AND (companies.website IS NOT NULL)
AND (employees.country = 'Uruguay')
ORDER BY employees.connections DESC
This is the plan for that query:
Unique (cost=877170.24..880752.72 rows=62304 width=1064) (actual time=24023.736..26001.876 rows=73318 loops=1)
-> Sort (cost=877170.24..877326.00 rows=62304 width=1064) (actual time=24023.733..24305.989 rows=77579 loops=1)
Sort Key: employees.connections DESC, employees.id, employees.name, employees.link, employees.role, employees.area, employees.profile_picture, employees.summary, employees.current_companies, employees.previous_companies, employees.skills, employees.education, employees.languages, employees.volunteer, employees.groups, employees.interests, employees.search_vector, employees.secondary_search_vector, employees.email_status, employees.languages_count, employees.role_hierarchy
Sort Method: external merge Disk: 85816kB
-> Nested Loop (cost=2642.38..843246.15 rows=62304 width=1064) (actual time=139.870..23056.234 rows=77579 loops=1)
-> Hash Join (cost=2641.95..221744.50 rows=77860 width=1068) (actual time=139.841..22617.587 rows=77579 loops=1)
Hash Cond: (employees.id = employee_companies.employee_id)
-> Seq Scan on employees (cost=0.00..212178.88 rows=409672 width=1064) (actual time=8.145..22369.166 rows=393725 loops=1)
Filter: ((country)::text = 'Uruguay'::text)
Rows Removed by Filter: 1075
-> Hash (cost=1666.42..1666.42 rows=78042 width=8) (actual time=44.675..44.675 rows=78042 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 4073kB
-> Seq Scan on employee_companies (cost=0.00..1666.42 rows=78042 width=8) (actual time=0.007..22.901 rows=78042 loops=1)
Filter: (employee_id IS NOT NULL)
-> Index Scan using companies_pkey on companies (cost=0.43..7.97 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=77579)
Index Cond: (id = employee_companies.company_id)
Filter: (website IS NOT NULL)
Planning time: 1.957 ms
Execution time: 26025.045 ms
And these are the relevant indexes that I have on my table:
employees:
"employees_pkey" PRIMARY KEY, btree (id)
"ix_employees_country" btree (country)
companies:
"companies_pkey" PRIMARY KEY, btree (id)
"empty_websites" btree (website) WHERE website IS NULL
"index_companies_on_website" btree (website)
"not_empty_websites" btree (website) WHERE website IS NOT NULL
employee_companies:
"employee_companies_pkey" PRIMARY KEY, btree (id)
"index_employee_companies_on_company_id" btree (company_id)
"index_employee_companies_on_employee_id" btree (employee_id)
"index_employee_companies_on_employee_id_and_company_id" btree (employee_id, company_id)
"not_empty_employee_id" btree (employee_id) WHERE employee_id IS NOT NULL
Is there any other better way to do what I want that is more efficient/performant?
Thanks!
Best Answer
Based on some guess-simulations, I think you can slightly improve your query by:
DISTINCT
clause (although there will be an implicitlyDISTINCT
).JOIN
.The query is as follows:
The data used to produce the simulation is the following one:
Table and index definitions:
1.000.000 companies (changing to 10M doesn't make a big difference). I assume 90% have a website.
80k employees (about 10% are Germans)
200K employees x companies (this means that people have worked in about 3 companies, on average):
You can check a downsized version of this simulation at dbfiddle here. If this simulated data is sufficiently similar to your scenario, changing the query makes a 3x improvement with regard to server-execution time. I'd suggest you give it a try.
Simulating data (scaled down by a factor of 25) a scenario more similar to your real one doesn't offer such a nice increase in performance... Nevertheless, it improves by a 1.5 factor.
Check it at this dbfiddle