Mysql – Left join not working like expected

join;MySQL

I need to do a table join to find missing translations in a translation system.

Let's say we have the following simplified table (id is the identifier of the string, the actual translation is not included in the example for simplicity's sake):

+------+-------+
| lang | id    |
+------+-------+
| en   | hello |
| en   | world |
| fr   | hello |
| de   | world |
+------+-------+

My initial thought was to do a simple left join:

select a.*, b.* from trans a left join trans b using (id) where a.lang = 'en';

I thought I would get this (so I could add where b.id is null to find the missing translation):

+------+-------+------+-------+
| lang | id    | lang | id    |
+------+-------+------+-------+
| en   | hello | en   | hello |
| en   | world | en   | world |
| en   | hello | fr   | hello |
| en   | world | NULL | NULL  |
| en   | hello | NULL | NULL  |
| en   | world | de   | NULL  |
+------+-------+------+-------+

What I actually get is this:

+------+-------+------+-------+
| lang | id    | lang | id    |
+------+-------+------+-------+
| en   | hello | en   | hello |
| en   | world | en   | world |
| en   | hello | fr   | hello |
| en   | world | de   | world |
+------+-------+------+-------+

What mistake have I made in my way of thinking, and what is the correct query to get the desired output?

Best Answer

If you run the statement without the where clause you'll see why:

lang | id    | lang | id   
-----+-------+------+------
en   | hello | en   | hello
en   | hello | fr   | hello
fr   | hello | en   | hello
fr   | hello | fr   | hello
en   | world | en   | world
en   | world | de   | world
de   | world | en   | world
de   | world | de   | world

The join on the "id" column works like this:

Take the first hello from the table and look for all rows that contain hello - that yield two rows for the first hello. The same happens with the second hello, so you wind up with 2x2 rows for the join on hello. And the same for world

The outer join does not play any role, because there is a match for each id (actually: two matches).

You can never get your first (intended) result because that implies that all rows in the "a" table have lang='en' (which is of course not true).

To get the missing translations you need to first create the combination of all languages and ids:

select distinct a.id, b.lang
from trans a
  cross join trans b;

Now you need to find all rows that are not in that result:

select *
from ( 
  select distinct a.id, b.lang
  from trans a
    cross join trans b
) ac
where not exists (select 1
                  from trans mt
                  where mt.id = ac.id 
                    and mt.lang = ac.lang);

You can achieve this with an outer join as well. I simply prefer the not exists because it documents more clearly the intention (and because I hardly ever work with MySQL which is known to perform poorly with sub-queries like that)

select ac.*
from ( 
  select distinct a.id, b.lang
  from trans a
    cross join trans b
) ac
  left join trans mt on mt.id = ac.id and mt.lang = ac.lang
where mt.id is null;

Here is an SQLFiddle: http://sqlfiddle.com/#!2/9804d/6

Edit

after testing the peformance with larger tables, it seems that Sean's version of the cross join is much more efficient than mine.

So this statement should be faster than the ones above:

select at.*
from (
    select lang_code, label_code
    from (
        SELECT distinct lang_code 
        FROM translations
    ) as translang
      cross join (
         SELECT distinct label_code 
         FROM translations
      ) as transid
) at
  left join translations mt 
         on mt.lang_code = at.lang_code
        and mt.label_code = at.label_code
where mt.lang_code is null;

Edit 2

And another version to be tested (SQL-Fiddle):

SELECT a.lang, a.id,
       l.lang AS blang
FROM trans a 
  CROSS JOIN 
    ( SELECT DISTINCT lang
      FROM trans
    ) l
  LEFT JOIN trans b 
    ON  b.id = a.id
    AND b.lang = l.lang 
WHERE a.lang = 'en'
  AND b.id IS NULL ;

Related Solutions

Sql-server – Why is this Full Outer Join not working

You should do an isnull on both policynumbers so that you can group properly.

Since it's an outer-join, there's the possibility of one side of the join being NULL while still having data.

select sum(premium) Prem_Sum, sum(Loss) Loss_Sum, isnull(t1.policynumber, t2.policynumber)
from t1 full outer join t2 on t1.policynumber = t2.policynumber
group by isnull(t1.policynumber, t2.policynumber)

Mysql – How to simulate a table join on itself

You'll need to first create a list of every product_number and date combination. You can do this using a CROSS JOIN of your table:

select distinct p.product_number, d.date
from yourtable p
cross join yourtable d;

See SQL Fiddle with Demo. This will create a list of data similar to:

| PRODUCT_NUMBER |                            DATE |
|----------------|---------------------------------|
|            100 |  January, 01 2010 00:00:00+0000 |
|            200 |  January, 01 2010 00:00:00+0000 |
|            100 | February, 01 2010 00:00:00+0000 |

You will then use the above query and LEFT JOIN to your table to return the final result:

select 
  pd.product_number,
  pd.date,
  t.value
from
(
  -- list of every product/date
  select distinct p.product_number, d.date
  from yourtable p
  cross join yourtable d
) pd
left join yourtable t
  on pd.date = t.date
  and pd.product_number = t.product_number
order by pd.product_number, pd.date;

See SQL Fiddle with Demo. Giving a final result of:

| PRODUCT_NUMBER |                            DATE |  VALUE |
|----------------|---------------------------------|--------|
|            100 |  January, 01 2010 00:00:00+0000 |      1 |
|            100 | February, 01 2010 00:00:00+0000 |      1 |
|            100 |    March, 01 2010 00:00:00+0000 |      1 |
|            200 |  January, 01 2010 00:00:00+0000 |      1 |
|            200 | February, 01 2010 00:00:00+0000 | (null) |
|            200 |    March, 01 2010 00:00:00+0000 |      1 |

The LEFT JOIN returns all rows from your list of products and dates regardless of whether a matching row exists in the other table.

This could also be written as:

select 
  p.product_number,
  d.date,
  t.value
from
(
  -- list of every product
  select distinct product_number
  from yourtable 
) p  
cross join
( 
  -- list of every date
  select distinct date
  from yourtable 
) d
  -- then join to the table
left join yourtable t
  on  d.date = t.date
  and p.product_number = t.product_number
order by p.product_number, d.date ;

See SQL Fiddle with Demo. This may have better performance depending on your table size.

Now if you wanted to return a list of all dates, regardless of whether or not they appear in the table, then I would suggest creating a table of dates. This table would be used in a similar manner to create a list of all dates/products which you would then join.

The table would be similar to:

CREATE TABLE dates
    (`date` datetime)
;

INSERT INTO dates
    (`date`)
VALUES
    ('2010-01-01 00:00:00'),
    ('2010-02-01 00:00:00'),
    ('2010-03-01 00:00:00'),
    ('2010-04-01 00:00:00'),
    ('2010-05-01 00:00:00')
;

You'd then use the following query to get the list of dates/products:

select distinct p.product_number, d.date
from yourtable p
cross join dates d

And finally, you would join that back to your table:

select 
  pd.product_number,
  pd.date,
  t.value
from
(
  -- list of every product/date
  select distinct p.product_number, d.date
  from yourtable p
  cross join dates d
) pd
left join yourtable t
  on pd.date = t.date
  and pd.product_number = t.product_number
order by pd.product_number, pd.date;

See SQL Fiddle with Demo. Or an alternative:

select 
  p.product_number,
  d.date,
  t.value
from
(
  -- list of every product
  select distinct product_number
  from yourtable 
) p  
cross join
 dates
  d
  -- then join to the table
left join yourtable t
  on  d.date = t.date
  and p.product_number = t.product_number
order by p.product_number, d.date ;

See SQL Fiddle with Demo. Again this may have better performance based on the table size. Using this type of solution, you'd return all dates even those not in your table:

| PRODUCT_NUMBER |                            DATE |  VALUE |
|----------------|---------------------------------|--------|
|            100 |  January, 01 2010 00:00:00+0000 |      1 |
|            100 | February, 01 2010 00:00:00+0000 |      1 |
|            100 |    March, 01 2010 00:00:00+0000 |      1 |
|            100 |    April, 01 2010 00:00:00+0000 | (null) |
|            100 |      May, 01 2010 00:00:00+0000 | (null) |
|            200 |  January, 01 2010 00:00:00+0000 |      1 |
|            200 | February, 01 2010 00:00:00+0000 | (null) |
|            200 |    March, 01 2010 00:00:00+0000 |      1 |
|            200 |    April, 01 2010 00:00:00+0000 | (null) |
|            200 |      May, 01 2010 00:00:00+0000 | (null) |

Best Answer

Related Solutions

Sql-server – Why is this Full Outer Join not working

Mysql – How to simulate a table join on itself

Related Question