Mysql – Find the user with closest taste of mine

MySQLoptimizationquery-performance

I trying to write in SQL the following requirement:

Each user can rate as much films as they like. (1= I don't like, 5=I love it)
And the system should provide to the current user a list of films that he doesn't rate yet and that other users like too.

To simplify, I have this table and the following data:

create table wich (
  uid int,
  film varchar(50),
  rate int
);

insert into wich values 
(1, 'usual suspect', 5), (1, 'gataca', 4), (1, 'goldeneye', 2),
(2, 'usual suspect', 4), (2, 'gataca', 5), (2, 'i am a legend', 4), (2, 'the hobbit', 1),
(3, 'usual suspect', 1), (3, 'gataca', 5), (3, 'goldeneye', 5),
(4, 'usual suspect', 5), (4, 'goldeneye', 5),
(5, 'usual suspect', 5), (5, 'gataca', 4), (5, 'goldeneye', 5),
(6, 'usual suspect', 4), (6, 'gataca', 3), (6, 'goldeneye', 3), (6, 'shrek', 4);

This can be read as: User 1 like 'usual suspect' and 'gataca' but not 'goldeneye'.

So what I want know is to find the user(s) that have the closest tastes of the current user and get a list of films that the current user could probably like.

Following, what I have done so far:

Step 1: for every films rated by user 1, if user1 and user2 rate the film in the same way, compute a score (should be 0 if the difference is greater than 2)
and it's great if both like a film or both doesn't like a film.

select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
        from wich w1
        inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
        where w1.uid=1

Step 2: compute a score by user

select scores.uid, SUM(scores.score) * 100 / sum(scores.nb) as score
from (
    select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
    from wich w1
    inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
    where w1.uid=1
) scores
group by uid
ORDER BY score desc

step 3: find films that user1 has not yet rated that he could like

select w3.uid, w3.film,w3.rate, matches.score
from (
    select scores.uid, SUM(scores.score) * 100 / sum(scores.nb) as score
    from (
        select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
        from wich w1
        inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
        where w1.uid=1
    ) scores
    group by uid
    ORDER BY score desc
) as matches
inner join wich w3 on matches.uid=w3.uid and w3.rate >= 3
left join wich w4 on w4.uid=1 and w3.film=w4.film
where w4.uid is null
order by w3.rate desc, matches.score desc;

This seems to work, but I am not sure that this will still respond in a small amount of time if each user rate of lot of films and the table become become bigger.

What do you think ?

Is there a better way to accomplish such a thing ?

A working fiddle: http://sqlfiddle.com/#!2/89c57/1/0


Edit:

In my real table, the film column is an integer and I have a second table with films data (title, description, year, …)

create table wich (
  uid int,
  film int,
  rate int
);

The example given here is just to simplify the problem with just one table.

Best Answer

You can make a table which holds the users with similar taste for every user.

CREATE TABLE `similar_users` (
    `uid1` INT,
    `uid2` INT,
    `score` INT,
    PRIMARY KEY (`uid1`, `uid2`)
)

The algorithm for calculating the score can be anything, as long as a higher score means a closer taste.

You you should limit the amount of similar users to a fixed number and also exclude the ones with a score less than a certain value (the ones with a not so similar tastes).

The contents of the table will be calculated periodically (e.g. every day).

Then step 3 from your solution becomes something like this:

SELECT w.film
FROM similar_users su
JOIN wich w ON w.uid = su.uid2 AND w.rate > 2
LEFT JOIN wich wn ON wn.film = w.film AND wn.uid = {current_user}
WHERE su.uid1 = {current_user}
ORDER BY su.score * w.rate DESC