Mysql – How to use and optimize subquery on 100 million rows

MySQLperformancequery-performancesubquery

What I'm trying to do is running a job on more than 100 million domains which haven't processed before.

I have two tables, "domain" and "domain_setting", on every batch (10.000 domains per batch) I'm getting a list of domains from domain table by checking their status on "domain_setting". At first it was working fine but now approximately 10 million domain processed and it's getting slower.

I know that my server is crappy for this kind of database, it has only 8 GB of memory and it's also shared with Apache web server (150 concurrent connections), some other small databases and some lightweight PHP applications. My innodb_buffer_pool_size is only 1 GB.

Anyway, I would like to learn if there is a better way or thoughts to write this query more efficiently or any other approach for doing this job and speed up the process. Here is my SQL query:

SELECT * FROM domain
WHERE domain_ID NOT IN (SELECT domain_ID FROM domain_setting) OR domain_ID IN (SELECT domain_ID FROM domain_setting WHERE is_keyword_checked = 0)
LIMIT 0,10000;

I must get domains which are not exist in domain_setting table OR has "is_keyword_checked = 0" value in domain_setting table. How can I optimize this query?

Thanks in advance 🙂

EDIT:
I can rewrite my query with LEFT JOIN as

SELECT *
FROM domain
LEFT JOIN domain_setting ON domain_setting.domain_ID = domain.domain_ID
WHERE domain_setting.is_keyword_checked = 0 OR domain_setting.is_keyword_checked IS NULL
ORDER BY domain.domain_ID ASC
LIMIT 0,10000;

but the main problem here seems full table scan. I can run it backwards by ORDER by domain.domain_ID DESC for recent added records and I can get them very fast. But still needs to scan full table if you didn't run it for a while and if recent records became more than millions of rows. So, is there any way to avoid full table scan for such work?

Best Answer

Since is_keyword_checked can take only two possible values, 0 or 1 (and it's not nullable), you can try the rewriting. There is no OR and the index on (is_keyword_checked) will be used:

SELECT * 
FROM domain
WHERE domain_ID NOT IN 
      ( SELECT domain_ID 
        FROM domain_setting 
        WHERE is_keyword_checked = 1
      ) ;

or with a NOT EXISTS subquery - which should have similar execution plan:

SELECT d.* 
FROM domain AS d
WHERE NOT EXISTS 
      ( SELECT * 
        FROM domain_setting AS ds
        WHERE ds.is_keyword_checked = 1
          AND ds.domain_ID = d.domain_ID
      ) ;