Mysql – Can filtering be done more efficiently with/without sql joins

indexmariadbMySQLnosql

I have a simple problem that has a simple solution with SQL, but would like to explore alternative ways to solve it if they turn out to be more efficient on large scale.

Let's assume that we have a system where we have users, videos and list of videos users have viewed:

CREATE TABLE `video` (
  `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  `rank` INT NULL,
  `created_at` DATETIME NULL,
  PRIMARY KEY (`id`),
  INDEX `idx_rank` (`rank` ASC));

CREATE TABLE `user` (
  `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  `created_at` DATETIME NULL,
  PRIMARY KEY (`id`));

CREATE TABLE `user_view` (
  `user_id` BIGINT NOT NULL,
  `video_id` BIGINT NOT NULL,
  PRIMARY KEY (`user_id`, `video_id`));

CREATE TABLE `user_friend` (
  `user_id` BIGINT NOT NULL,
  `friend_id` BIGINT NOT NULL,
  PRIMARY KEY (`user_id`, `friend_id`));

The question is: How would we find all the videos that friends of a user have not viewed?

In my solution, I would get the ids of friends (Let's assume user has 3 friends, with ids 1, 2, 3) and build a query:

SELECT
    v.id,
    uv1.video_id
FROM
    video v
    LEFT JOIN user_view uv1 ON (v.id = uv1.video_id AND uv1.user_id = 1)
    LEFT JOIN user_view uv2 ON (v.id = uv2.video_id AND uv2.user_id = 2)
    LEFT JOIN user_view uv3 ON (v.id = uv3.video_id AND uv3.user_id = 3)
WHERE 1
    AND v.id > 100
    AND uv1.video_id IS NULL
    AND uv2.video_id IS NULL
    AND uv3.video_id IS NULL
ORDER BY
    v.rank
LIMIT
    30

Above query will work, but more friends a user has, more joins we'd have to add to the query.

Let's assume we're dealing with 1 billion videos, 100 million users and on averahe user having 50 friends.

Is there a more efficient way to do this in SQL?

Is there a way to do this with non traditional SQL way? Perhaps with noSql with mongodb, cassandra, riak, redis, couchdb or anything else? I'm wondering if there is anything else more efficient purpose built for this.

Any other programming/processing technique that would prove to be more efficient?

I would really appreciate your input.

Best Answer

I don't understand your two choices, but I feel sure that a LEFT JOIN is involved.

Ponder something like this: Step 1:

SELECT uv.video_id
    FROM (
        SELECT uf.friend_id    -- all your friends
            FROM user_friend AS uf
            WHERE uf.user_id = u.id
              AND u.id = 123   -- you
         ) AS f
        LEFT JOIN user_view AS uv  ON uv.user_id = f.friend_id
        WHERE uv.id IS NULL    -- that they did not view

But, beware, the output could be very long.

I don't know if you need DISTINCT after SELECT.

Step 2: Now to peel off the 30 highest ranked ones:

SELECT v.id
    FROM ( the above query ) AS b
    JOIN video AS v  ON b.video_id = v.id
    ORDER BY rank DESC
    LIMIT 30

Unfortunately, the query must find all the un-viewed videos before it can sort to locate the top 30.

Related Solutions

Mysql – Unexplained InnoDB timeouts

I know this is really late, but you really need to capture the output of SHOW ENGINE INNODB STATUS; during that query to see why it's waiting.

If it happens a lot during a specific time, it would be easy to just grab that output every x seconds and hope you capture it (or perhaps artificially generate the load).

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Best Answer

Related Solutions

Mysql – Unexplained InnoDB timeouts

MySQL looking up more rows than needed (indexing issue)

Related Question