I have a simple problem that has a simple solution with SQL, but would like to explore alternative ways to solve it if they turn out to be more efficient on large scale.
Let's assume that we have a system where we have users, videos and list of videos users have viewed:
CREATE TABLE `video` (
`id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`rank` INT NULL,
`created_at` DATETIME NULL,
PRIMARY KEY (`id`),
INDEX `idx_rank` (`rank` ASC));
CREATE TABLE `user` (
`id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`created_at` DATETIME NULL,
PRIMARY KEY (`id`));
CREATE TABLE `user_view` (
`user_id` BIGINT NOT NULL,
`video_id` BIGINT NOT NULL,
PRIMARY KEY (`user_id`, `video_id`));
CREATE TABLE `user_friend` (
`user_id` BIGINT NOT NULL,
`friend_id` BIGINT NOT NULL,
PRIMARY KEY (`user_id`, `friend_id`));
The question is: How would we find all the videos that friends of a user have not viewed?
In my solution, I would get the ids of friends (Let's assume user has 3 friends, with ids 1, 2, 3) and build a query:
SELECT
v.id,
uv1.video_id
FROM
video v
LEFT JOIN user_view uv1 ON (v.id = uv1.video_id AND uv1.user_id = 1)
LEFT JOIN user_view uv2 ON (v.id = uv2.video_id AND uv2.user_id = 2)
LEFT JOIN user_view uv3 ON (v.id = uv3.video_id AND uv3.user_id = 3)
WHERE 1
AND v.id > 100
AND uv1.video_id IS NULL
AND uv2.video_id IS NULL
AND uv3.video_id IS NULL
ORDER BY
v.rank
LIMIT
30
Above query will work, but more friends a user has, more joins we'd have to add to the query.
Let's assume we're dealing with 1 billion videos, 100 million users and on averahe user having 50 friends.
Is there a more efficient way to do this in SQL?
Is there a way to do this with non traditional SQL way? Perhaps with noSql with mongodb, cassandra, riak, redis, couchdb or anything else? I'm wondering if there is anything else more efficient purpose built for this.
Any other programming/processing technique that would prove to be more efficient?
I would really appreciate your input.
Best Answer
I don't understand your two choices, but I feel sure that a
LEFT JOIN
is involved.Ponder something like this: Step 1:
But, beware, the output could be very long.
I don't know if you need
DISTINCT
afterSELECT
.Step 2: Now to peel off the 30 highest ranked ones:
Unfortunately, the query must find all the un-viewed videos before it can sort to locate the top 30.