Mysql – how to optimize thesql query/ structure/ approach for large datasets

MySQLoptimization

Mysql Database structure:

I have two tables:

Users Table with autoincrementing id contains about 20K rows and growing:

id userName

1 John
2 Doe
3 Alice

Contents Table with autoincrementing id contains about 10K rows and growing:

id Content

1 Content1
2 Content2
3 Content3

UserContent Table for association with both userId and contentId as foreign Keys and autoincrementing id:

id userId contentId
1  1      1
2  1      2
3  1      3
4  2      1
4  2      2 
...

Problem Statement:

The problem is that i have a process which runs everyday and runs a query which picks up the next content item for every user that he or she has not received previously, and adds that unique combination of userId and contentId in the userContents table.

The Contents should be added to each user's account in the same order.

For example, the process runs on day 1 and adds 20k rows (one row for each user) in the table. on Day 2, another 20k rows are added and so on. As you can imagine, this has lead to alot of data in the userContent table which is slowing down the query. Currently, there are about 4.4 Million records on the userContent table and it is growing day by day.

users cannot modify/clear the content that they have received. the records in userContent table will remain the same always.

Current Sql Query:

the current query that i am using is this:

select u.id as userId, content.id as contentId
 from users u, contents c 
where 
u.id not in 
(select distinct uc.userId from userContent uc where uc.userId = u.id and uc.contentId = c.id)
and c.id not in 
(select distinct uc.contentId from userContent uc where uc.userId = u.id and uc.contentId = c.id)

What this query does

this query will pick out a unique content for each and every user in the database which that user has previously not received.

Lets dry run this:

Day 1: there are three users in the system: User A, User B and User C. Each of these users will get content A.

Day 2: Each of the three users A, B and C will receive Content B.

Day 3: A new user D signs up. the first three users A, B and C will receive content C whereas User D will receive content A.

Major problems that i have identified in this query:

  1. since there is nothing common between users and contents, i have to make a cross join which results in a total dataset of 200,000,000 rows! (10k content rows and 20k users rows).

  2. For each of these 200,000,000 rows, the query checks whether the combination of userId and contentId exists in the userContent table or not by making two separate sub queries.

Help needed in:

Currently, this query takes well over a minute to execute. Is there any way i can speed up this query by changing the query or the database structure or even the approach itself?

any help in this regard will be highly appreciated! Also, let me know if further clarifications are needed.

thanks!

Best Answer

I believe that I have the (better) solution to your problem.

(tl; dr) I've created a query which CROSS JOINs newly users with newly inserted content (few records presumably), thereby avoiding an expensive CROSS JOIN between the full user and content table (many records, i.e. expensive!).

I created tables user, content and user_content in the dbfiddle here. See DML and DDL at the end of this answer. I changed some of the names of the tables &c. to reflect my own preferences

  • singular table names (tables are collections anyway)

  • all lower case table names with underscores)

  • a composite PRIMARY KEY for user_content (see here for my thoughts on composite PRIMARY KEYs).

I then ran this query:

SELECT u.user_id AS "User ID", x.content_id AS "Content ID"
FROM user u
CROSS JOIN 
(
  SELECT c.content_id FROM content c WHERE c.content_id NOT IN (SELECT content_id FROM user_content)
) AS x
UNION
SELECT y.user_id, c.content_id
FROM content c
CROSS JOIN 
(
  SELECT u.user_id FROM user u WHERE u.user_id NOT IN (SELECT user_id FROM user_content)
) AS y
ORDER BY "User ID", "Content ID";

Which gives the same results as your one!

What my query is doing is CROSS JOINing the results of a search for new content (in the first subquery) with a search for new users (second subquery) providing, AIUI, this is the desired result, but without a CROSS JOIN over the full tables. There are differences in the result of EXPLAIN EXTENDED.

my query:

mysql> explain extended SELECT u.user_id AS "User ID", x.content_id AS "Content ID"
    -> FROM user u
    -> CROSS JOIN 
    -> (
    ->   SELECT c.content_id FROM content c WHERE c.content_id NOT IN (SELECT content_id FROM user_content)
    -> ) AS x
    -> UNION
    -> SELECT y.user_id, c.content_id
    -> FROM content c
    -> CROSS JOIN 
    -> (
    ->   SELECT u.user_id FROM user u WHERE u.user_id NOT IN (SELECT user_id FROM user_content)
    -> ) AS y
    -> ORDER BY "User ID", "Content ID";
+----+--------------------+--------------+----------------+--------------------+--------------------+---------+------+------+----------+--------------------------+
| id | select_type        | table        | type           | possible_keys      | key                | key_len | ref  | rows | filtered | Extra                    |
+----+--------------------+--------------+----------------+--------------------+--------------------+---------+------+------+----------+--------------------------+
|  1 | PRIMARY            | <derived2>   | system         | NULL               | NULL               | NULL    | NULL |    1 |   100.00 |                          |
|  1 | PRIMARY            | u            | index          | NULL               | PRIMARY            | 4       | NULL |    4 |   100.00 | Using index              |
|  2 | DERIVED            | c            | index          | NULL               | PRIMARY            | 4       | NULL |    4 |   100.00 | Using where; Using index |
|  3 | DEPENDENT SUBQUERY | user_content | index_subquery | uc_content_user_uq | uc_content_user_uq | 4       | func |    4 |   100.00 | Using index              |
|  4 | UNION              | <derived5>   | system         | NULL               | NULL               | NULL    | NULL |    1 |   100.00 |                          |
|  4 | UNION              | c            | index          | NULL               | PRIMARY            | 4       | NULL |    4 |   100.00 | Using index              |
|  5 | DERIVED            | u            | index          | NULL               | PRIMARY            | 4       | NULL |    4 |   100.00 | Using where; Using index |
|  6 | DEPENDENT SUBQUERY | user_content | index_subquery | PRIMARY            | PRIMARY            | 4       | func |    4 |   100.00 | Using index              |
| NULL | UNION RESULT       | <union1,4>   | ALL            | NULL               | NULL               | NULL    | NULL | NULL |     NULL | Using filesort           |
+----+--------------------+--------------+----------------+--------------------+--------------------+---------+------+------+----------+--------------------------+
9 rows in set, 1 warning (0.01 sec)

then show the 1 warning:

mysql> show warnings;
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                                 |
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | select `ptest`.`u`.`user_id` AS `User ID`,'4' AS `Content ID` from `ptest`.`user` `u` union select '4' AS `user_id`,`ptest`.`c`.`content_id` AS `content_id` from `ptest`.`content` `c` order by 'User ID','Content ID' |
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Running the query in the warning

select 
  `ptest`.`u`.`user_id` AS `User ID`,
  '4' AS `Content ID` 
from `ptest`.`user` `u` 
union 
select 
  '4' AS `user_id`,
  `ptest`.`c`.`content_id` AS `content_id` 
from `ptest`.`content` `c` 
order by 'User ID','Content ID'

gives the correct result. I think that this query is MySQL's optimisation of my query - obviously, the '4' is taken from values in the tables and not hard-coded by me.

Running your orignal query gives 5 warnings:

mysql> explain extended SELECT 
    ->   u.user_id as "User ID", 
    ->   c.content_id AS "Content ID"
    -> FROM 
    ->   user u, content c 
    -> WHERE
    ->   u.user_id NOT IN
    ->                (
    ->                  SELECT DISTINCT uc.user_id 
    ->                  FROM user_content uc 
    ->                  WHERE uc.user_id = u.user_id AND uc.content_id = c.content_id
    ->                )
    -> 
    -> AND 
    ->   c.content_id NOT IN 
    ->               (
    ->                 SELECT DISTINCT uc.content_id 
    ->                 FROM user_content uc 
    ->                 WHERE uc.user_id = u.user_id 
    ->                 AND uc.content_id = c.content_id
    -> )
    -> ORDER BY user_id, content_id;
+----+--------------------+-------+--------+----------------------------+---------+---------+------------------------------------+------+----------+----------------------------------------------+
| id | select_type        | table | type   | possible_keys              | key     | key_len | ref                                | rows | filtered | Extra                                        |
+----+--------------------+-------+--------+----------------------------+---------+---------+------------------------------------+------+----------+----------------------------------------------+
|  1 | PRIMARY            | u     | index  | NULL                       | PRIMARY | 4       | NULL                               |    4 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | PRIMARY            | c     | index  | NULL                       | PRIMARY | 4       | NULL                               |    4 |   100.00 | Using where; Using index; Using join buffer  |
|  3 | DEPENDENT SUBQUERY | uc    | eq_ref | PRIMARY,uc_content_user_uq | PRIMARY | 8       | ptest.u.user_id,ptest.c.content_id |    1 |   100.00 | Using where; Using index; Using temporary    |
|  2 | DEPENDENT SUBQUERY | uc    | eq_ref | PRIMARY,uc_content_user_uq | PRIMARY | 8       | ptest.u.user_id,ptest.c.content_id |    1 |   100.00 | Using where; Using index; Using temporary    |
+----+--------------------+-------+--------+----------------------------+---------+---------+------------------------------------+------+----------+----------------------------------------------+
4 rows in set, 5 warnings (0.00 sec)

mysql>

4 of the warnings are about resolving fields or references, but the final one is a select (which is quite complex!)

select 
  `ptest`.`u`.`user_id` AS `User ID`,
  `ptest`.`c`.`content_id` AS `Content ID` 
from `ptest`.`user` `u` 
join 
  `ptest`.`content` `c` 
   where 
   (
     (
       not(<in_optimizer>
             (`ptest`.`u`.`user_id`,
              <exists>(select distinct 1 from `ptest`.`user_content` `uc` 
                 where ((`ptest`.`uc`.`user_id` = `ptest`.`u`.`user_id`) 
                 and (`ptest`.`uc`.`content_id` = `ptest`.`c`.`content_id`) 
                 and (<cache>(`ptest`.`u`.`user_id`) = `ptest`.`uc`.`user_id`)))))) 
  and (not(<in_optimizer>(`ptest`.`c`.`content_id`,
              <exists>(select distinct 1 from `ptest`.`user_content` `uc` 
                 where ((`ptest`.`uc`.`user_id` = `ptest`.`u`.`user_id`) 
                 and (`ptest`.`uc`.`content_id` = `ptest`.`c`.`content_id`) 
                 and (<cache>(`ptest`.`c`.`content_id`) = `ptest`.`uc`.`content_id`))))))) 
order by `ptest`.`u`.`user_id`,`ptest`.`c`.`content_id`

As you can see, your EXPLAIN EXTENDED query is a good deal more complex than mine. My query provides a CROSS JOIN of new users and content (presumably smaller than the existing table), and I believe will be more performant on large datasets - which I don't have. You'll need to test!

------------------- DDL and DML --------------------------

CREATE TABLE user
(
  user_id INTEGER(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
  user_name VARCHAR (50) NOT NULL
);

CREATE TABLE content
(
  content_id INTEGER(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
  content_text VARCHAR (1000) NOT NULL  -- can be a BLOB, whatever
);

INSERT INTO user (user_name) VALUES
('Bob'),
('Alice'),
('Eve');

INSERT INTO content (content_text)  VALUES
('Content1'),
('Content2'),
('Content3');

CREATE TABLE user_content 
(
  user_id INTEGER NOT NULL,
  content_id INTEGER NOT NULL,
  CONSTRAINT user_content_pk PRIMARY KEY (user_id, content_id),
  CONSTRAINT uc_user_fk FOREIGN KEY (user_id) REFERENCES user (user_id),
  CONSTRAINT uc_content_fk FOREIGN KEY (content_id) REFERENCES content (content_id),
  CONSTRAINT uc_content_user_uq UNIQUE INDEX (content_id, user_id) 

-- will be unique anyway because of PK, but you can never give the optimiser
-- too much information. This index will speed up content by user searches.
 );   

INSERT INTO user_content 
VALUES
(1, 1),
(1, 2),
(1, 3),
(2, 1),
(2, 2),
(2, 3),
(3, 1),
(3, 2),
(3, 3);

-- now insert a user with no content and content with no user

INSERT INTO user (user_name) VALUES ('Mary');
INSERT INTO content (content_text) VALUES ('Content4');