Mysql – Left join not working like expected

join;MySQL

I need to do a table join to find missing translations in a translation system.

Let's say we have the following simplified table (id is the identifier of the string, the actual translation is not included in the example for simplicity's sake):

+------+-------+
| lang | id    |
+------+-------+
| en   | hello |
| en   | world |
| fr   | hello |
| de   | world |
+------+-------+

My initial thought was to do a simple left join:

select a.*, b.* from trans a left join trans b using (id) where a.lang = 'en';

I thought I would get this (so I could add where b.id is null to find the missing translation):

+------+-------+------+-------+
| lang | id    | lang | id    |
+------+-------+------+-------+
| en   | hello | en   | hello |
| en   | world | en   | world |
| en   | hello | fr   | hello |
| en   | world | NULL | NULL  |
| en   | hello | NULL | NULL  |
| en   | world | de   | NULL  |
+------+-------+------+-------+

What I actually get is this:

+------+-------+------+-------+
| lang | id    | lang | id    |
+------+-------+------+-------+
| en   | hello | en   | hello |
| en   | world | en   | world |
| en   | hello | fr   | hello |
| en   | world | de   | world |
+------+-------+------+-------+

What mistake have I made in my way of thinking, and what is the correct query to get the desired output?

Best Answer

If you run the statement without the where clause you'll see why:

lang | id    | lang | id   
-----+-------+------+------
en   | hello | en   | hello
en   | hello | fr   | hello
fr   | hello | en   | hello
fr   | hello | fr   | hello
en   | world | en   | world
en   | world | de   | world
de   | world | en   | world
de   | world | de   | world

The join on the "id" column works like this:

Take the first hello from the table and look for all rows that contain hello - that yield two rows for the first hello. The same happens with the second hello, so you wind up with 2x2 rows for the join on hello. And the same for world

The outer join does not play any role, because there is a match for each id (actually: two matches).

You can never get your first (intended) result because that implies that all rows in the "a" table have lang='en' (which is of course not true).

To get the missing translations you need to first create the combination of all languages and ids:

select distinct a.id, b.lang
from trans a
  cross join trans b;

Now you need to find all rows that are not in that result:

select *
from ( 
  select distinct a.id, b.lang
  from trans a
    cross join trans b
) ac
where not exists (select 1
                  from trans mt
                  where mt.id = ac.id 
                    and mt.lang = ac.lang);

You can achieve this with an outer join as well. I simply prefer the not exists because it documents more clearly the intention (and because I hardly ever work with MySQL which is known to perform poorly with sub-queries like that)

select ac.*
from ( 
  select distinct a.id, b.lang
  from trans a
    cross join trans b
) ac
  left join trans mt on mt.id = ac.id and mt.lang = ac.lang
where mt.id is null;

Here is an SQLFiddle: http://sqlfiddle.com/#!2/9804d/6

Edit

after testing the peformance with larger tables, it seems that Sean's version of the cross join is much more efficient than mine.

So this statement should be faster than the ones above:

select at.*
from (
    select lang_code, label_code
    from (
        SELECT distinct lang_code 
        FROM translations
    ) as translang
      cross join (
         SELECT distinct label_code 
         FROM translations
      ) as transid
) at
  left join translations mt 
         on mt.lang_code = at.lang_code
        and mt.label_code = at.label_code
where mt.lang_code is null;

Edit 2

And another version to be tested (SQL-Fiddle):

SELECT a.lang, a.id,
       l.lang AS blang
FROM trans a 
  CROSS JOIN 
    ( SELECT DISTINCT lang
      FROM trans
    ) l
  LEFT JOIN trans b 
    ON  b.id = a.id
    AND b.lang = l.lang 
WHERE a.lang = 'en'
  AND b.id IS NULL ;