Mysql Database structure:
I have two tables:
Users Table with autoincrementing id contains about 20K rows and growing:
id userName
1 John
2 Doe
3 Alice
Contents Table with autoincrementing id contains about 10K rows and growing:
id Content
1 Content1
2 Content2
3 Content3
UserContent Table for association with both userId and contentId as foreign Keys and autoincrementing id:
id userId contentId
1 1 1
2 1 2
3 1 3
4 2 1
4 2 2
...
Problem Statement:
The problem is that i have a process which runs everyday and runs a query which picks up the next content
item for every user that he or she has not received previously, and adds that unique combination of userId and contentId in the userContents table.
The Contents
should be added to each user's account in the same order.
For example, the process runs on day 1 and adds 20k rows (one row for each user) in the table. on Day 2, another 20k rows are added and so on. As you can imagine, this has lead to alot of data in the userContent table which is slowing down the query. Currently, there are about 4.4 Million records on the userContent table and it is growing day by day.
users cannot modify/clear the content that they have received. the records in userContent table will remain the same always.
Current Sql Query:
the current query that i am using is this:
select u.id as userId, content.id as contentId
from users u, contents c
where
u.id not in
(select distinct uc.userId from userContent uc where uc.userId = u.id and uc.contentId = c.id)
and c.id not in
(select distinct uc.contentId from userContent uc where uc.userId = u.id and uc.contentId = c.id)
What this query does
this query will pick out a unique content for each and every user in the database which that user has previously not received.
Lets dry run this:
Day 1: there are three users in the system: User A, User B and User C. Each of these users will get content A.
Day 2: Each of the three users A, B and C will receive Content B.
Day 3: A new user D signs up. the first three users A, B and C will receive content C whereas User D will receive content A.
Major problems that i have identified in this query:
-
since there is nothing common between users and contents, i have to make a cross join which results in a total dataset of 200,000,000 rows! (10k content rows and 20k users rows).
-
For each of these 200,000,000 rows, the query checks whether the combination of userId and contentId exists in the userContent table or not by making two separate sub queries.
Help needed in:
Currently, this query takes well over a minute to execute. Is there any way i can speed up this query by changing the query or the database structure or even the approach itself?
any help in this regard will be highly appreciated! Also, let me know if further clarifications are needed.
thanks!
Best Answer
I believe that I have the (better) solution to your problem.
(tl; dr) I've created a query which
CROSS JOIN
s newly users with newly inserted content (few records presumably), thereby avoiding an expensiveCROSS JOIN
between the full user and content table (many records, i.e. expensive!).I created tables user, content and user_content in the dbfiddle here. See
DML
andDDL
at the end of this answer. I changed some of the names of the tables &c. to reflect my own preferencessingular table names (tables are collections anyway)
all lower case table names with underscores)
a composite
PRIMARY KEY
for user_content (see here for my thoughts on compositePRIMARY KEY
s).I then ran this query:
Which gives the same results as your one!
What my query is doing is
CROSS JOIN
ing the results of a search for new content (in the first subquery) with a search for new users (second subquery) providing, AIUI, this is the desired result, but without aCROSS JOIN
over the full tables. There are differences in the result ofEXPLAIN EXTENDED
.my query:
then show the 1 warning:
Running the query in the warning
gives the correct result. I think that this query is MySQL's optimisation of my query - obviously, the '4' is taken from values in the tables and not hard-coded by me.
Running your orignal query gives 5 warnings:
4 of the warnings are about resolving fields or references, but the final one is a select (which is quite complex!)
As you can see, your
EXPLAIN EXTENDED
query is a good deal more complex than mine. My query provides aCROSS JOIN
of new users and content (presumably smaller than the existing table), and I believe will be more performant on large datasets - which I don't have. You'll need to test!------------------- DDL and DML --------------------------
-- now insert a user with no content and content with no user