Mysql – why should I join 2 tables with 1 table apart

join;MySQL

I could not solve this problem on hackerrank , and had to look up the solution. I was joining the wrong tables.

There are 4 tables:

  • hackers
  • challenges
  • submissions made by hackers and their scores
  • difficulty table with levels.

the task description:

Write a query to print the respective hacker_id and name of hackers
who achieved full scores for more than one challenge.

Submissions table fields: submission_id, hacker_id, challenge_id, score

challenge table fields: challenge_id, hacker_id, difficulty_level

The way i joined: hackers + challenges, challenges + difficulty , challenges + submissions.

select  
hackers.hacker_id, name
 
from submissions inner join challenges on submissions.challenge_id = challenges.challenge_id 
inner join difficulty on difficulty.difficulty_level = challenges.difficulty_level
inner join hackers on 

challenges.hacker_id = hackers.hacker_id -- here is wrong part!

where difficulty.score = submissions.score and difficulty.difficulty_level = challenges.difficulty_level
group by hackers.hacker_id, name 
having count( challenges.challenge_id)> 1
order by count( challenges.challenge_id) desc, hackers.hacker_id 

However, the right way was almost same – except that i should have joined
submissions + hackers by hacker_id
vs
not submissions + challenges by challenge_id.

The correct way:

select  
hackers.hacker_id,name

from submissions inner join challenges on submissions.challenge_id = challenges.challenge_id 
inner join difficulty on difficulty.difficulty_level = challenges.difficulty_level
inner join hackers on submissions.hacker_id = hackers.hacker_id
where difficulty.score = submissions.score and difficulty.difficulty_level = challenges.difficulty_level
group by hackers.hacker_id, name 
having count( challenges.challenge_id)> 1
order by count( challenges.challenge_id) desc, hackers.hacker_id 

What's the logic behind joining submissions + challenges by challenge_id vs
submissions + hackers by hacker_id?

why does it produce different result? A hacker makes submissions , so it should not matter whether i join challenge+submission + hackers or
submission + hacker + challenge…

Best Answer

It's a bit of a guess because you haven't provided the link to the challenge (in markdown you have to make links like [this problem](http://whatever.com) but clicking your link just takes to http://this%20problem, so I guess the markdown is broken

You've said you have tables:

Hacker - I guess this is like a user table
Challenge - I guess this is a problem to solve
Submission - I guess it's an answer a hacker has made to a problem

So we can clearly have "one hacker may answer many challenges" and "one challenge has many hackers answering it" ie a many many relationship so we need a table in the middle to split it down and track which hacker answered which challenge, and this is the Submissions table doing this breakdown

This interesting part here, and possibly put in deliberately to trip people up, is that Challenge also has a hacker Id column. For some reason hacker rank's tables track a hacker-challenge relationship that is nothing to do with a submission ie the hacker ID in the challenge table is something where there can be only one hacker for that challenge, such as the hacker that created it, or the hacker that first got perfect score on it, or the hacker who is responsible for marking the answers on it, or the hacker that most recently took it..

Whatever the reason, you said you joined on that hacker ID which in my mind is the wrong thing to do because it bears no relation to the question of multiple hackers who have submitted: it could be that the hacker in the challenge.hacker_id has never even made a submission on the challenge, they simply wrote it and have never answered it. It therefore cannot give you the answers the challenge seeks. It's probably meant to catch you out and it's one reason why when I'm designing a database schema I aim to put the reason for a column into its name in cases like this; ie in the Challenge table I might have called it author_hackerid or moderator_hackerid to describe not only where it links to (hacker table on hacker ID) but why it links there (author, moderator, firstwinner, recenttaker) especially if there are multiple people.. (though in such a case it may make more sense to have another table, like submission) that splits hacker and challenge, and gives a reason for why that hacker is related to that challenge)

Remember, that just because two columns are named the same it doesn't mean they are alternative join options. Even columns that have participated in a join before are not alternative, if it's a left join. If it's an inner join, fair game:

a INNER JOIN b ON a.id = b.id INNER JOIN c ON b.id = c.id
a INNER JOIN b ON a.id = b.id INNER JOIN c ON a.id = c.id
                                              ^

but these can give different results:

a LEFT JOIN b ON a.id = b.id LEFT JOIN c ON b.id = c.id
a LEFT JOIN b ON a.id = b.id LEFT JOIN c ON a.id = c.id

That small but critical difference in the second query of both tables left joining to a might be huge