I'm not looking for alternative ways to do the updates. Having things
like a tmp table [will] lock all the rows that are being updated until
they all finish (which could be hours), which won't work for me. They
MUST be in these awful loops.
I disagree.
The strength of an RDBMS is in performing set operations like "update all these rows plz". Given this, your intuition should tell you that these "awful loops" are not the best way to go except under very rare circumstances.
Let's take a look at your current update logic and understand what it's doing.
First off, the set autocommit=0
line in your script is unnecessary. Because you explicitly open a transaction immediately after that with start transaction
, autocommit
automatically becomes disabled until you end the transaction with COMMIT
or ROLLBACK
.
Now for the meat of the logic: You've wrapped all these individual updates inside the loop in one big transaction. If your intention behind the iterative updates was to reduce locking and increase concurrency, the wrapped transaction defeats that intention. MySQL must maintain locks on every row it updates until the transaction commits so it can roll them all back at once if the transaction fails or is cancelled. Furthermore, instead of knowing in advance that it is about to lock this range of rows (which would enable MySQL to issue locks with the appropriate granularity) the engine is forced to issue a large number of row-level locks in rapid-fire. Given that you are updating 1 million rows, this is a massive burden on the engine.
I propose two solutions:
Turn autocommit
on and remove the transaction wrapper. MySQL will then be able to release every row lock right after it finishes updating the row. It is still forced to issue and release a massive number of locks in a short period of time, so I doubt this will be an appropriate fix for you. Furthermore, if some error occurs halfway through the loop, nothing will be rolled back since the work is not transaction-bound.
Batch your updates in a temp table. You mentioned and then dismissed this solution, but I bet it will perform best. Have you already tried it? I would first test the full million-row update. If that takes too long then batch the work into progressively smaller chunks until you've found the sweet spot: the batches are big enough to get the total work done quickly, but no individual batch blocks other processes for too long. This is a common technique DBAs use when they have to modify a large number of rows during live operations. Remember, since your goal is to maximize your concurrency, keep autocommit
on and don't wrap any of this work into a massive transaction so MySQL releases its locks as soon as possible.
Notice that as the batches become progressively smaller, this solution eventually approximates the first one. That is why I am confident this solution will perform better: When the database engine can group its work into chunks, it flies.
There are basically two options with your scenario, as I understand it.
- Dynamic SQL (see a_vlad's answer)
- A (potentially) insanely massive
JOIN
.
- Return "canned" data.
A massive JOIN
For option two, you have to know the names of your tables when you write the query - and, those tables have to exist.
So, step one would be determine any future tables that will be needed, and pre-create them. If you can't do this, you can still build the query - but it will have to be modified whenever a new table is added.
So, assuming that your criteria allow us to create as many tables as you'll need for the foreseeable future in advance, next you have to build the new query:
SELECT DATE(COALESCE(t1.timestamp,t2.timestamp, t3.timestamp, t4.timestamp)) As Date
,AVG(COALESCE(t1.Value,t2.Value,t3.Value,t4.Value)) AS Value
FROM
cmsdb.usertable ut
LEFT JOIN cmsdb.grouptable gt ON ut.groupID = gt.ID
LEFT JOIN Table1 t1 ON (gt.tablename = 'Table1')
LEFT JOIN Table2 t2 ON (gt.tablename = 'Table2')
LEFT JOIN Table3 t3 ON (gt.tablename = 'Table3')
LEFT JOIN Table4 t4 ON (gt.tablename = 'Table4')
WHERE u.User = 2
AND COALESCE(t1.timestamp,t2.timestamp, t3.timestamp, t4.timestamp) BETWEEN CURDATE()-INTERVAL 4 WEEK AND CURDATE()
GROUP BY DATE(COALESCE(t1.timestamp,t2.timestamp, t3.timestamp, t4.timestamp))
As you can see, this is already pretty unwieldy with 4 tables - with 200, it'd be hideous.
NOTE: code is untested, and I don't mainly work w/ MySQL, but this is simple enough that it should work; you might need to play with exactly where the COALESCE
function should be.
The key here, if it isn't obvious, is that we only actually pay attention to data from the table with the right name.
There are a couple of assumptions here:
- Only one table can be valid for any given user. If a user can belong to multiple groups and thus see multiple tables, the results may not be correct.
- This assumes that you really want a summary of all the table's contents in the date range for any user in the correct group; if there's anything user-specific, you'd need to add that check, using
COALESCE
again to pick up the column form the table we actually joined with.
In many ways, a dynamic SQL solution would be easier on the programmer. With that, you can avoid some of the possible problems with this (it can't handle new tables, renamed tables, etc.)
Canned data
The third option only works if the data returned does not have to be live.
For this option, you create another table - let's call it GroupSummary
. This table has three key values: grouptable
(can be used as key), SummaryDate
, AvgValue
GroupSummary
would be populated by a dynamic SQL statement, like in a_vlad's answer. You could loop through the possible tables with a cursor, and have the whole thing set up as a stored program. Run the stored program as often as needed to pick up new data (noting that running it too frequently is likely to cause performance issues with your SELECT
query). If reasonably fresh data is required, I'd be thinking refreshing the data every 5-15 minutes; at a minimum, it should be refreshed after midnight (as this would, of course, change the base date it's working from).
Then, the query in your webpage would be:
SELECT SummaryDate as Date
,AvgValue as Value
FROM GroupSummary gs
INNER JOIN grouptable gt ON (gs.grouptable = gt.grouptable)
INNER JOIN usertable ut ON (gt.ID = ut.groupID)
WHERE User = %USERID%
(Note: WHERE
clause based on your sample query, adjust as needed).
Best Answer
This sort of thing can be done fairly elegantly using Window Functions. While I was researching this answer, I found that MySQL doesn't support them. The code that follows is will likely have some performance problems since it uses nested
GROUP BY
select statements. Proceed with caution.The case above only works for one day due to the
WHERE
clause in line 12 and 13. If you need to do this for more than one day at a time, you should add a function which returns the date to the inner-most sub-query and then carry that date in theJoin_Back_User
subquery to the outermost query...