I have a query as part of my process, that I execute hundreds of times in a loop.
Initially, Table A contains all records (20mil). Table B contains 0 records.
Primary key in both tables ID
the I execute:
select * from table A where a.ID not in (select ID from table b) limit 10000
##magic stuff in python
insert everything to table B, once again, .
Initially, the query runs super fast, but after Nth loop (100th+), size of table B increases, to the point where I it takes a bit of time to perform the NOT IN operation.
Does anyone have recommendations on how I can speed up the query?
– So far, I've tweaked the default mysql bugger to be 1.5gbs (ids are pretty small INTs, so that should be enough).
Caveats:
1) One way to do this would be to remove * from table A after I've processed them. However, I want to keep table A in tact.
… only method I could think of I adding another column to table A (which I'd index) called 'PROCESSED'… then update that column with a second query once the records have been processed/posted, but I was hoping there was an easier solution.
Thank you all in advance.
Best Answer
If your goal is to look at every row in
A
and do something with it, there is a much more efficient way. (It seems thatB
is merely there to see what you have already processed.)The reason for it getting slower is that it has to do more work as it gets farther into
A
-- namely skipping over the rows it has processed. Aprocessed
flag might suffer the same malady.So...
Walk through
A
processing chunks as you go. Then remember where you left off so that the next 10000 will be right there, no searching. I discuss that in more detail with an eye toDELETEing
, but it can be adapted for other purposes: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks With that,B
is unnecessary.Partitioning
If you are thinking about partitioning the data by months, I have to ask you "Why?". Here are some answers:
PARTITION BY RANGE(...)
and include the year, too. More: http://mysql.rjweb.org/doc.php/partitionmaint