Mysql – Difficulty with nested query order & grouping

MySQL

I'm trying to identify the count of objects a user updated during a given day, but only where the user was the final update.

Each object can have multiple updates in a day from any user, but the final update is the one that should be counted.

Each object only has 1 status though–always as a result of the most recent update.

Object_Update_Table

|UpdateID| UserID |timestamp| ObjectID |
| 1      |   3    | today   |    1     |
| 2      |   3    | today   |    2     |
| 3      |   1    | today   |    3     |
| 4      |   1    | today   |    3     |
| 5      |   1    | today   |    1     |

Object_Table

|ObjectID| Status  |
| 1      | failed  |
| 2      | failed  |
| 3      | progress|

Final desired output of query

|User |count "failed"|
| 1   |  1           |
| 2   |  0           |
| 3   |  1           |

Best Answer

This sort of thing can be done fairly elegantly using Window Functions. While I was researching this answer, I found that MySQL doesn't support them. The code that follows is will likely have some performance problems since it uses nested GROUP BY select statements. Proceed with caution.

SELECT UserID
     -- potentially add a date column here
     , sum(Counter) Updates
FROM (   
     SELECT UserID
          , ObjectID
          --- potentially add a date column here
          , 1 Counter
     FROM Object_Update_Table OUT inner join 
          (
          SELECT ObjectID
               -- potentially add a date column here
               , max(timestamp)
          FROM Object_Update_Table OUT
          WHERE timestamp >= today 00:00:00
                AND timestamp < tomorrow
          -- if you add a date column, you'll need to add a GROUP BY clause with the date column here
          ) last_update_day ON OUT.ObjectID = last_update_day.ObjectID
                           AND OUT.timestamp = last_update_day.timestamp
     ) Join_Back_User
GROUP BY Join_Back_User.UserID
     --, potentially add a date column here

The case above only works for one day due to the WHERE clause in line 12 and 13. If you need to do this for more than one day at a time, you should add a function which returns the date to the inner-most sub-query and then carry that date in the Join_Back_User subquery to the outermost query...

Related Solutions

Mysql – Concurrent MySQL updates hang with InnoDB (on Amazon RDS)

I'm not looking for alternative ways to do the updates. Having things like a tmp table [will] lock all the rows that are being updated until they all finish (which could be hours), which won't work for me. They MUST be in these awful loops.

I disagree.

The strength of an RDBMS is in performing set operations like "update all these rows plz". Given this, your intuition should tell you that these "awful loops" are not the best way to go except under very rare circumstances.

Let's take a look at your current update logic and understand what it's doing.

First off, the set autocommit=0 line in your script is unnecessary. Because you explicitly open a transaction immediately after that with start transaction, autocommit automatically becomes disabled until you end the transaction with COMMIT or ROLLBACK.

Now for the meat of the logic: You've wrapped all these individual updates inside the loop in one big transaction. If your intention behind the iterative updates was to reduce locking and increase concurrency, the wrapped transaction defeats that intention. MySQL must maintain locks on every row it updates until the transaction commits so it can roll them all back at once if the transaction fails or is cancelled. Furthermore, instead of knowing in advance that it is about to lock this range of rows (which would enable MySQL to issue locks with the appropriate granularity) the engine is forced to issue a large number of row-level locks in rapid-fire. Given that you are updating 1 million rows, this is a massive burden on the engine.

I propose two solutions:

Turn autocommit on and remove the transaction wrapper. MySQL will then be able to release every row lock right after it finishes updating the row. It is still forced to issue and release a massive number of locks in a short period of time, so I doubt this will be an appropriate fix for you. Furthermore, if some error occurs halfway through the loop, nothing will be rolled back since the work is not transaction-bound.
Batch your updates in a temp table. You mentioned and then dismissed this solution, but I bet it will perform best. Have you already tried it? I would first test the full million-row update. If that takes too long then batch the work into progressively smaller chunks until you've found the sweet spot: the batches are big enough to get the total work done quickly, but no individual batch blocks other processes for too long. This is a common technique DBAs use when they have to modify a large number of rows during live operations. Remember, since your goal is to maximize your concurrency, keep autocommit on and don't wrap any of this work into a massive transaction so MySQL releases its locks as soon as possible.

Notice that as the batches become progressively smaller, this solution eventually approximates the first one. That is why I am confident this solution will perform better: When the database engine can group its work into chunks, it flies.

Mysql – Nested MySQL FROM query with two DBs

There are basically two options with your scenario, as I understand it.

Dynamic SQL (see a_vlad's answer)
A (potentially) insanely massive JOIN.
Return "canned" data.

A massive JOIN

For option two, you have to know the names of your tables when you write the query - and, those tables have to exist.

So, step one would be determine any future tables that will be needed, and pre-create them. If you can't do this, you can still build the query - but it will have to be modified whenever a new table is added.

So, assuming that your criteria allow us to create as many tables as you'll need for the foreseeable future in advance, next you have to build the new query:

SELECT DATE(COALESCE(t1.timestamp,t2.timestamp, t3.timestamp, t4.timestamp)) As Date
      ,AVG(COALESCE(t1.Value,t2.Value,t3.Value,t4.Value)) AS Value
FROM
  cmsdb.usertable ut
  LEFT JOIN cmsdb.grouptable gt ON ut.groupID = gt.ID
  LEFT JOIN Table1 t1 ON (gt.tablename = 'Table1')
  LEFT JOIN Table2 t2 ON (gt.tablename = 'Table2')
  LEFT JOIN Table3 t3 ON (gt.tablename = 'Table3')
  LEFT JOIN Table4 t4 ON (gt.tablename = 'Table4')
WHERE u.User = 2
  AND COALESCE(t1.timestamp,t2.timestamp, t3.timestamp, t4.timestamp) BETWEEN CURDATE()-INTERVAL 4 WEEK AND CURDATE()
GROUP BY DATE(COALESCE(t1.timestamp,t2.timestamp, t3.timestamp, t4.timestamp))

As you can see, this is already pretty unwieldy with 4 tables - with 200, it'd be hideous.

NOTE: code is untested, and I don't mainly work w/ MySQL, but this is simple enough that it should work; you might need to play with exactly where the COALESCE function should be.

The key here, if it isn't obvious, is that we only actually pay attention to data from the table with the right name.

There are a couple of assumptions here:

Only one table can be valid for any given user. If a user can belong to multiple groups and thus see multiple tables, the results may not be correct.
This assumes that you really want a summary of all the table's contents in the date range for any user in the correct group; if there's anything user-specific, you'd need to add that check, using COALESCE again to pick up the column form the table we actually joined with.

In many ways, a dynamic SQL solution would be easier on the programmer. With that, you can avoid some of the possible problems with this (it can't handle new tables, renamed tables, etc.)

Canned data

The third option only works if the data returned does not have to be live.

For this option, you create another table - let's call it GroupSummary. This table has three key values: grouptable (can be used as key), SummaryDate, AvgValue

GroupSummary would be populated by a dynamic SQL statement, like in a_vlad's answer. You could loop through the possible tables with a cursor, and have the whole thing set up as a stored program. Run the stored program as often as needed to pick up new data (noting that running it too frequently is likely to cause performance issues with your SELECT query). If reasonably fresh data is required, I'd be thinking refreshing the data every 5-15 minutes; at a minimum, it should be refreshed after midnight (as this would, of course, change the base date it's working from).

Then, the query in your webpage would be:

SELECT SummaryDate as Date
      ,AvgValue as Value
  FROM GroupSummary gs
         INNER JOIN grouptable gt ON (gs.grouptable = gt.grouptable)
           INNER JOIN usertable ut ON (gt.ID = ut.groupID)
 WHERE User = %USERID%

(Note: WHERE clause based on your sample query, adjust as needed).

Best Answer

Related Solutions

Mysql – Concurrent MySQL updates hang with InnoDB (on Amazon RDS)

Mysql – Nested MySQL FROM query with two DBs

Related Question