I'm trying to get all users that haven't been sent a message in a given day. I have two tables, one with the messages that have a reporterid indexed column, and a reporter table.
Let me walk you through it:
This inner query returns reporters that have received messages in a given day. It takes tenths of a second on my local dev machine:
select m.reporterid
from nubamessage m
where m.messagetypeenum =7
and m.createdOn>='2016-06-18 00:00:00'
and m.createdOn<='2016-06-18 09:30:00';
There is no difference in the execution time of the above query if I do select(distinct) or not, and it returns the same rows.
But if I used that in an outer join query, it basically never completes:
select *
from reporter r
left outer join
(
SELECT m.reporterid
from nubamessage m
where m.messagetypeenum =7
and m.createdOn>='2016-06-18 00:00:00'
and m.createdOn<='2016-06-18 09:30:00'
) as mm ON mm.reporterid=r.id
where r.enabled=1 and m.reporterid is null;
However, if I change the inner query to distinct as the below, it takes a fraction of a second:
select *
from reporter r
left outer join
(
SELECT distinct(m.reporterid) -- the only difference
from nubamessage m
where m.messagetypeenum =7
and m.createdOn>='2016-06-18 00:00:00'
and m.createdOn<='2016-06-18 09:30:00'
) as mm ON mm.reporterid=r.id
where r.enabled=1 and m.reporterid is null;
The execution plan for the non-completing query:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE r NULL ALL NULL NULL NULL NULL 11538 50.00 "Using where"
1 SIMPLE m NULL ALL NULL NULL NULL NULL 968388 100.00 "Using where; Using join buffer (Block Nested Loop)"
The execution plan for the query where I just add "distinct", it adds some auto-key:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY r NULL ALL NULL NULL NULL NULL 11538 50.00 "Using where"
1 PRIMARY <derived2> NULL ref <auto_key0> <auto_key0> 5 r.id 10 100.00 "Using index"
2 DERIVED m NULL ALL NULL NULL NULL NULL 968388 1.11 "Using where; Using temporary"
If someone could find it in their heart to explain to me why 1. the first query never completes/is so slow and 2. why the distinct makes it run fast, even though the inner query returns the exact same rows in both cases, I'd be most grateful.
Thanks
Edit: Notice that I write "all users that haven't been sent a message. This is why I have the left join – later in the query I have "where m.reporterid is null", in order to only get out the reporters that don't already have a message in the messagetable. I have edited the questions above to reflect that.
I could have used a "not in" instead but from what I've read and in my tests it was a bit slower than left join.
Best Answer
LEFT
is killing performance; remove it unless you have a good reason for keeping it. To elaborate...((but first... The purpose for the
LEFT
was missing from the original query; this answer assumes theLEFT
was not necessary.))LEFT JOIN
says that you want data from the right table, whether or not there was a match with the left table.JOIN
says to display only the rows that match (viaON
) both tables.(
OUTER
is optional and adds no semantics.)If you are running a version before 5.6, the derived table (subquery in
LEFT JOIN
) will have no index, so it must be scanned repeatedly. This is a big reason to get rid ofLEFT
.Without the
LEFT
, the Optimizer is likely to evaluate the subquery once, then efficientlyJOIN
toreporter
to finish the query.For
JOIN
, the inner query needs this composite (and covering) indexINDEX(messagetypeenum, createdOn, reporterid)
.There is another technique (I think)... Get rid of the inner
SELECT
, simplyJOIN
(orLEFT JOIN
) to the table:In this case, it may needs
INDEX(reporterid, messagetypeenum, createdOn)
.Yet another variant would use
EXISTS
and, I think, provides the equivalent of theLEFT
I can't predict which variant will be fastest. It partially depends on how "many" in the many:1 mapping of
m.reporterid
:r.id
.