Mysql – Grouping similar rows together based on time

gaps-and-islandsMySQL

I have a table that records all the changes done to a model by a user. The table looks something like this: table structure (SQLFiddle)

In the table an action_hash is generated based on the model, model_id and timestamp, which effectively generates a unique identifier of an action.

I want to to display the data from this table to my users as an "activity feed". At first I was querying the database like so:

-- Get the last 5 actions (for pagination)
SELECT `action_hash` from `changes` GROUP BY `action_hash` ORDER BY `timestamp` DESC LIMIT 5;

-- Get all the rows for this last 5 action
SELECT * FROM `changes` WHERE `action_hash` IN ('a1a2913ac13c34afd957c1646243d8fa', 'af024d2e0b56c8af6506e8f755bb1e39', '33ccb3d33e4add95486a58dd5b33495a', 'b2fcb43f711ea6e047b4610334e546c8', '85f21b48085cce9ff7bab9c0b26dfb8f')

(SQLFiddle)
Result1

I would get a result like above. Then my backend code would go through the results in the last query and group the rows by hash, and parse them to be human readable while still maintaining the pagination.

The problem with this method is that some rows are so similar showing them as separate changes creates too much noise in the activity feed. The similar rows should somehow be grouped together.
result group together

The rows highlighted above should be grouped together uniquely identified in my final result (maybe with another hash?) because the changes occurring are happening to the same model (and model_id) within 15 minutes of each other. How can I write a query that shows this?

This is the result I'm looking for:
ideal results
Note: action_group doesn't have to be integers, any string that indicate they associate with one another will do. I'd also would like to paginate my results based on the new action_group column.

Best Answer

I would start with a query that finds the latest row in each group of similar rows that are less than 15 minutes apart, and picks the latest five of those:

SELECT DISTINCT model, model_id, timestamp 
FROM changes c1 WHERE (model, model_id) NOT IN
  (SELECT model, model_id FROM changes c2 
   WHERE c2.timestamp > c1.timestamp 
     AND c2.timestamp < c1.timestamp + INTERVAL 15 MINUTE)
ORDER BY timestamp DESC LIMIT 5;

To get a unique number for each group, you can use a variable:

SELECT model, model_id, timestamp, @num:=@num+1 grp 
FROM (<query#1>) last, (SELECT @num:=0) init;

Using this query as a derived table (subquery in FROM clause), you can find all rows that belong to these groups:

SELECT c3.*, groups.grp action_group 
FROM changes c3 JOIN (<query#2>) groups USING(model, model_id) 
WHERE c3.timestamp <= groups.timestamp 
  AND groups.timestamp < c3.timestamp + INTERVAL 15 MINUTE 
ORDER BY c3.timestamp DESC;

The full query then becomes:

SELECT c3.*, groups.grp action_group 
FROM changes c3 JOIN 
    (SELECT model, model_id, timestamp, @num:=@num+1 grp 
     FROM (SELECT DISTINCT model, model_id, timestamp 
           FROM changes c1 
           WHERE (model, model_id) NOT IN
               (SELECT model, model_id FROM changes c2 
                WHERE c2.timestamp > c1.timestamp 
                AND c2.timestamp < c1.timestamp + INTERVAL 15 MINUTE)
           ORDER BY timestamp DESC LIMIT 5) last, 
           (SELECT @num:=0) init
    ) groups USING(model, model_id) 
WHERE c3.timestamp <= groups.timestamp 
  AND groups.timestamp < c3.timestamp + INTERVAL 15 MINUTE 
ORDER BY c3.timestamp DESC;

If your changes table is very big, the above query might take some time. Indexes on (model, model_id) and timestamp might help. Also, it might be possible to rewrite the query to use JOIN instead of NOT IN.