Mysql – Duplicates while doing multiple joins

duplicationgroup byjoin;MySQL

I have 4 MySQL tables

1. companies - {id*, company}
2. employees - {id*, fname, lname, id_company*}
3. technologies - {id*, technology}
4. company_technologies - {id_technology*, id_company*}

Field is indexed if it has '*'.

Companies can use multiple technologies and have multiple employees.

What I'm trying to do is write a query that would select all employees and their companies that use ANY of given technologies. The problem is that I have duplicates when I try to do something like this:

SELECT employees.*, companies.* FROM companies LEFT JOIN employees on companies.id = employees.id_company LEFT JOIN company_technologies ON companies.id = company_technologies.id_company WHERE company_technologies.id_technology IN(1,2,3)

If, for example, a company is using technology 1 and 2 – it will appear twice along with all people.

If I add 'GROUP BY companies.id' condition at the end – I only have one employee for each company, whereas I need all of them.

Tables 1, 2 and 4 have several million records, all tables are InnoDB.

Best Answer

You can convert the join to a semi-join, using either IN or EXISTS:

SELECT employees.*, companies.* 
FROM companies 
  LEFT JOIN employees 
    ON companies.id = employees.id_company 
WHERE EXISTS
      ( SELECT 1
        FROM company_technologies 
        WHERE companies.id = company_technologies.id_company 
          AND company_technologies.id_technology IN (1,2,3)
      ) ;

Another way would be to use GROUP BY or DISTINCT in a subquery and then join:

SELECT e.*, c.* 
FROM companies AS c
  LEFT JOIN employees AS e
    ON c.id = e.id_company 
  JOIN
      ( SELECT id_company
        FROM company_technologies
        WHERE id_technology IN (1,2,3)
        GROUP BY id_company
      ) AS ct
    ON c.id = ct.id_company ;

Once you have one or more queries working correctly, you can start worrying about efficiency. I would add 2 composite indexes on the junction table, one on (id_technology,id_company) and one on (id_company,id_technology) (and remove the single indexes.)

Then test all queries with various parameters and table sizes and check which indexes are used and what are the execution plans.