I'm not looking for alternative ways to do the updates. Having things
like a tmp table [will] lock all the rows that are being updated until
they all finish (which could be hours), which won't work for me. They
MUST be in these awful loops.
I disagree.
The strength of an RDBMS is in performing set operations like "update all these rows plz". Given this, your intuition should tell you that these "awful loops" are not the best way to go except under very rare circumstances.
Let's take a look at your current update logic and understand what it's doing.
First off, the set autocommit=0
line in your script is unnecessary. Because you explicitly open a transaction immediately after that with start transaction
, autocommit
automatically becomes disabled until you end the transaction with COMMIT
or ROLLBACK
.
Now for the meat of the logic: You've wrapped all these individual updates inside the loop in one big transaction. If your intention behind the iterative updates was to reduce locking and increase concurrency, the wrapped transaction defeats that intention. MySQL must maintain locks on every row it updates until the transaction commits so it can roll them all back at once if the transaction fails or is cancelled. Furthermore, instead of knowing in advance that it is about to lock this range of rows (which would enable MySQL to issue locks with the appropriate granularity) the engine is forced to issue a large number of row-level locks in rapid-fire. Given that you are updating 1 million rows, this is a massive burden on the engine.
I propose two solutions:
Turn autocommit
on and remove the transaction wrapper. MySQL will then be able to release every row lock right after it finishes updating the row. It is still forced to issue and release a massive number of locks in a short period of time, so I doubt this will be an appropriate fix for you. Furthermore, if some error occurs halfway through the loop, nothing will be rolled back since the work is not transaction-bound.
Batch your updates in a temp table. You mentioned and then dismissed this solution, but I bet it will perform best. Have you already tried it? I would first test the full million-row update. If that takes too long then batch the work into progressively smaller chunks until you've found the sweet spot: the batches are big enough to get the total work done quickly, but no individual batch blocks other processes for too long. This is a common technique DBAs use when they have to modify a large number of rows during live operations. Remember, since your goal is to maximize your concurrency, keep autocommit
on and don't wrap any of this work into a massive transaction so MySQL releases its locks as soon as possible.
Notice that as the batches become progressively smaller, this solution eventually approximates the first one. That is why I am confident this solution will perform better: When the database engine can group its work into chunks, it flies.
I think the issue is due to the choice of the clustered index. From MySQL docs, Clustered and Secondary Indexes:
Every InnoDB table has a special index called the clustered index where the data for the rows is stored. Typically, the clustered index is synonymous with the primary key. To get the best performance from queries, inserts, and other database operations, you must understand how InnoDB uses the clustered index to optimize the most common lookup and DML operations for each table.
Also check the answer by @marc_s in this SO question: How to choose the clustered index in SQL Server?, where he mentions:
According to The Queen Of Indexing - Kimberly Tripp - what she looks for in a clustered index is primarily:
And if you can also guarantee:
then you're pretty close to having your ideal clustering key!
Now, your clustered index is the (Primary Key):
hash varchar(5) CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL,
which (lets go through the check-list) is:
- Unique (yes, OK)
- Narrow (yes, OK)
- Static (perhaps, you know that)
but is probably not:
- Ever-increasing pattern (No, it probably isn't)
So, what happens when you use a non-ever-increasing clustered index?
I can't answer better than Kimberly L. Trip: Ever-increasing clustering key - the Clustered Index Debate..........again!
If the clustering key is ever-increasing then new rows have a specific location where they can be placed. If that location is at the end of the table then the new row needs space allocated to it but it doesn't have to make space in the middle of the table. If a row is inserted to a location that doesn't have any room then room needs to be made (e.g. you insert based on last name then as rows come in space will need to be made where that name should be placed). If room needs to be made, it's made by SQL Server doing something called a split. Splits in SQL Server are 50/50 splits - simply put - 50% of the data stays and 50% of the data is moved. This keeps the index logically intact (the lowest level of an index - called the leaf level - is a douly-linked list) but not physically intact. When an index has a lot of splits then the index is said to be fragmented. Good examples of an index that is ever-increasing are IDENTITY columns (and they're also naturally unique, natural static and naturally narrow) or something that follows as many of these things as possible - like a datetime column (or since that's NOT very likely to be unique by itself datetime, identity).
Note that despite the mention of SQL-Server, the same concept applies to InnoDB clustered indexes as well. I suppose that the clustered index has 2 issues:
When you are inserting a new row (the "random" hash guarantees that) it gets inserted in a random location of the index. This means that it sometimes will find no space there available to be inserted (note that InnoDB always leaves some space free in the index but when that free-available space is filled) there has to be some rearrangement of the index - and that takes time.
What the rearrangement is also causing over time is fragmentation of the index. Which will eventually make other queries and statements slower.
Best Answer
Not sure if you found a solution to this but for us it seems to be related to the size of data leaving the VPC.
If you run a query with a small result set:
The result returns fine, but if you take away the like and get a large result set back it fails. We're still investigating the problem but not sure you found a solution.