MySQL – How to Speed Up SQL NOT IN Query

MySQL

I have a query as part of my process, that I execute hundreds of times in a loop.

Initially, Table A contains all records (20mil). Table B contains 0 records.
Primary key in both tables ID
the I execute:

select * from table A where a.ID not in (select ID from table b) limit 10000
##magic stuff in python
insert everything to table B, once again, .

Initially, the query runs super fast, but after Nth loop (100th+), size of table B increases, to the point where I it takes a bit of time to perform the NOT IN operation.

Does anyone have recommendations on how I can speed up the query?
– So far, I've tweaked the default mysql bugger to be 1.5gbs (ids are pretty small INTs, so that should be enough).

Caveats:

1) One way to do this would be to remove * from table A after I've processed them. However, I want to keep table A in tact.

… only method I could think of I adding another column to table A (which I'd index) called 'PROCESSED'… then update that column with a second query once the records have been processed/posted, but I was hoping there was an easier solution.

Thank you all in advance.

Best Answer

If your goal is to look at every row in A and do something with it, there is a much more efficient way. (It seems that B is merely there to see what you have already processed.)

The reason for it getting slower is that it has to do more work as it gets farther into A -- namely skipping over the rows it has processed. A processed flag might suffer the same malady.

So...

Walk through A processing chunks as you go. Then remember where you left off so that the next 10000 will be right there, no searching. I discuss that in more detail with an eye to DELETEing, but it can be adapted for other purposes: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks With that, B is unnecessary.

Partitioning

If you are thinking about partitioning the data by months, I have to ask you "Why?". Here are some answers: