Mysql – Grouping similar rows together based on time

gaps-and-islandsMySQL

I have a table that records all the changes done to a model by a user. The table looks something like this: (SQLFiddle)

In the table an action_hash is generated based on the model, model_id and timestamp, which effectively generates a unique identifier of an action.

I want to to display the data from this table to my users as an "activity feed". At first I was querying the database like so:

-- Get the last 5 actions (for pagination)
SELECT `action_hash` from `changes` GROUP BY `action_hash` ORDER BY `timestamp` DESC LIMIT 5;

-- Get all the rows for this last 5 action
SELECT * FROM `changes` WHERE `action_hash` IN ('a1a2913ac13c34afd957c1646243d8fa', 'af024d2e0b56c8af6506e8f755bb1e39', '33ccb3d33e4add95486a58dd5b33495a', 'b2fcb43f711ea6e047b4610334e546c8', '85f21b48085cce9ff7bab9c0b26dfb8f')

(SQLFiddle)

I would get a result like above. Then my backend code would go through the results in the last query and group the rows by hash, and parse them to be human readable while still maintaining the pagination.

The problem with this method is that some rows are so similar showing them as separate changes creates too much noise in the activity feed. The similar rows should somehow be grouped together.

The rows highlighted above should be ~~grouped together~~ uniquely identified in my final result (maybe with another hash?) because the changes occurring are happening to the same model (and model_id) within 15 minutes of each other. How can I write a query that shows this?

This is the result I'm looking for:

Note: action_group doesn't have to be integers, any string that indicate they associate with one another will do. I'd also would like to paginate my results based on the new action_group column.

Best Answer

I would start with a query that finds the latest row in each group of similar rows that are less than 15 minutes apart, and picks the latest five of those:

SELECT DISTINCT model, model_id, timestamp 
FROM changes c1 WHERE (model, model_id) NOT IN
  (SELECT model, model_id FROM changes c2 
   WHERE c2.timestamp > c1.timestamp 
     AND c2.timestamp < c1.timestamp + INTERVAL 15 MINUTE)
ORDER BY timestamp DESC LIMIT 5;

To get a unique number for each group, you can use a variable:

SELECT model, model_id, timestamp, @num:=@num+1 grp 
FROM (<query#1>) last, (SELECT @num:=0) init;

Using this query as a derived table (subquery in FROM clause), you can find all rows that belong to these groups:

SELECT c3.*, groups.grp action_group 
FROM changes c3 JOIN (<query#2>) groups USING(model, model_id) 
WHERE c3.timestamp <= groups.timestamp 
  AND groups.timestamp < c3.timestamp + INTERVAL 15 MINUTE 
ORDER BY c3.timestamp DESC;

The full query then becomes:

SELECT c3.*, groups.grp action_group 
FROM changes c3 JOIN 
    (SELECT model, model_id, timestamp, @num:=@num+1 grp 
     FROM (SELECT DISTINCT model, model_id, timestamp 
           FROM changes c1 
           WHERE (model, model_id) NOT IN
               (SELECT model, model_id FROM changes c2 
                WHERE c2.timestamp > c1.timestamp 
                AND c2.timestamp < c1.timestamp + INTERVAL 15 MINUTE)
           ORDER BY timestamp DESC LIMIT 5) last, 
           (SELECT @num:=0) init
    ) groups USING(model, model_id) 
WHERE c3.timestamp <= groups.timestamp 
  AND groups.timestamp < c3.timestamp + INTERVAL 15 MINUTE 
ORDER BY c3.timestamp DESC;

If your changes table is very big, the above query might take some time. Indexes on (model, model_id) and timestamp might help. Also, it might be possible to rewrite the query to use JOIN instead of NOT IN.

Related Solutions

Mysql – Is it possible to reduce a group to a row based on some criteria

The traditional solution, the one you may find in books, is to do a self join: first find that "max date per group" of yours, then join to self table on rows with said max date.

However, some hacks allow you to avoid that. Consider the following query:

SELECT
  MAX(date_added) AS date_added,
  SUBSTRING_INDEX(GROUP_CONCAT(some_column ORDER BY date_added DESC), ',', 1) AS some_column,
  SUBSTRING_INDEX(GROUP_CONCAT(another_column ORDER BY date_added DESC), ',', 1) AS another_column
FROM t
GROUP BY whatever

GROUP_CONCAT is an aggregation function which implodes values onto one string. It allows for ORDER BY, which we utilize via ORDER BY date_added DESC so as to implode our desired value first. We then slice up the first token in the string via SUBSTRING_INDEX.

The downside here (apart from making the query quite the frightening appearance) is that your numerical values are transformed into texts. Typically no big deal with SQL, but please be aware.

There's another option where you do a semi-self-join, a much lighter one; you will have to give up usage of index. It's quite long to describe; it still uses GROUP_CONCAT and SUBSTRING_INDEX, but only for the purpose of creating a derived table with only relevant keys. This derived table is then joined with original table. See an example in SQL: selecting top N records per group, another solution.

MySQL Grouping Order

You can use ORDER BY inside GROUP_CONCAT() function. You can also change the separator, if you don't want to use comma. Further details in MySQL documentation: GROUP_CONCAT()

Example:

SELECT 
    product_id,
    GROUP_CONCAT(name 
                 ORDER BY language_id
                 SEPARATOR ','
                ) AS product_names
FROM
    products_languages
GROUP BY
    product_id ;

Best Answer

Related Solutions

Mysql – Is it possible to reduce a group to a row based on some criteria

MySQL Grouping Order

Related Question