If you run the statement without the where
clause you'll see why:
lang | id | lang | id
-----+-------+------+------
en | hello | en | hello
en | hello | fr | hello
fr | hello | en | hello
fr | hello | fr | hello
en | world | en | world
en | world | de | world
de | world | en | world
de | world | de | world
The join on the "id" column works like this:
Take the first hello
from the table and look for all rows that contain hello
- that yield two rows for the first hello. The same happens with the second hello, so you wind up with 2x2 rows for the join on hello. And the same for world
The outer join does not play any role, because there is a match for each id (actually: two matches).
You can never get your first (intended) result because that implies that all rows in the "a" table have lang='en' (which is of course not true).
To get the missing translations you need to first create the combination of all languages and ids:
select distinct a.id, b.lang
from trans a
cross join trans b;
Now you need to find all rows that are not in that result:
select *
from (
select distinct a.id, b.lang
from trans a
cross join trans b
) ac
where not exists (select 1
from trans mt
where mt.id = ac.id
and mt.lang = ac.lang);
You can achieve this with an outer join as well. I simply prefer the not exists
because it documents more clearly the intention (and because I hardly ever work with MySQL which is known to perform poorly with sub-queries like that)
select ac.*
from (
select distinct a.id, b.lang
from trans a
cross join trans b
) ac
left join trans mt on mt.id = ac.id and mt.lang = ac.lang
where mt.id is null;
Here is an SQLFiddle: http://sqlfiddle.com/#!2/9804d/6
Edit
after testing the peformance with larger tables, it seems that Sean's version of the cross join is much more efficient than mine.
So this statement should be faster than the ones above:
select at.*
from (
select lang_code, label_code
from (
SELECT distinct lang_code
FROM translations
) as translang
cross join (
SELECT distinct label_code
FROM translations
) as transid
) at
left join translations mt
on mt.lang_code = at.lang_code
and mt.label_code = at.label_code
where mt.lang_code is null;
Edit 2
And another version to be tested (SQL-Fiddle):
SELECT a.lang, a.id,
l.lang AS blang
FROM trans a
CROSS JOIN
( SELECT DISTINCT lang
FROM trans
) l
LEFT JOIN trans b
ON b.id = a.id
AND b.lang = l.lang
WHERE a.lang = 'en'
AND b.id IS NULL ;
If you want to take advantage of LIMIT
to improve performance, you need
- understand the data you are retrieving
- proper indexing the correct sequence of columns
- take responsibility for refactoring the query
- using
LIMIT
before JOIN
These principles can go a long way if you can orchestrate them.
I learned these concepts by watching this YouTube Video (listen carefully through the French accent)
I used those concepts to answer a very tough StackOverflow question about getting the top 40 articles from some tables : May 12, 2011 : Fetching a Single Row from Join Table.
In my answer to that question (May 16, 2011), I wrote the following query and tested it thoroughly:
SELECT
AAA.author_id,
AAA.date_created,
IFNULL(BBB.title,'<NO_TITLE>') title,
IFNULL(CCC.filename,'<NO-IMAGE>') filename,
IFNULL(CCC.date_added,'<NO-IMAGE-DATE>') image_date
FROM
(
SELECT
AA.id,
AA.date_added,
BB.author_id,
BB.date_created
FROM
(
SELECT
A.id,IFNULL(MAX(B.date_added),'1900-01-01 00:00:00') date_added
FROM (SELECT id FROM articles ORDER BY date_created DESC LIMIT 40) A
LEFT JOIN article_images B ON A.id = B.article_id
GROUP BY A.id
) AA
INNER JOIN articles BB USING (id)
) AAA
LEFT JOIN article_contents BBB ON AAA.id=BBB.article_id
LEFT JOIN article_images CCC
ON (AAA.id=CCC.article_id AND AAA.date_added=CCC.date_added)
ORDER BY AAA.date_created DESC;
Please notice the line in the query with the LIMIT
FROM (SELECT id FROM articles ORDER BY date_created DESC LIMIT 40) A
This subquery is buried three levels deep. This allowed me to get the last 40 articles using LIMIT
. Then, I performed the necessary JOINs afterwards.
LESSONS LEARNED
- Doing
LIMIT
inside subqueries may not always be the answer because of the cardinality of indexes, the data content, and the result set size from the LIMIT
. If you have all your "ducks in a row" (Have the four principles in mind for your query), you can get surprisingly good results.
- Make your queries as simplistic as possible when doing
LIMIT
by gathering keys only.
Best Answer
SQL is a declarative language; that is, it does not specify how something should be done, only what the result should be.
The computing languages I have used all respect the properties of equality, so
employees.ID = unionMembers.ID
is equivalent tounionMembers.ID = employees.ID
In this case, rather than an outer join, which can generate a large data-set, you would probably want to use a sub-query or the EXISTS operator, depending on your data structure. Something like:
SELECT ID, wage, unionDues FROM employees WHERE ID in ( SELECT ID FROM unionMembers WHERE ID = 'ID001' );
There is an in-depth discussion here: https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization-with-exists.html
HTH