Mysql – Count and Group data based on consecutive occurrence

gaps-and-islandsMySQL

Tricky for me to define the problem, maybe you understand it through my dummy data.

I have this data:

PK, TaskPK
1, 1    
2, 1     
3, 2    
4, 2    
5, 5    
6, 1 
7, 1     
8, 2 
9, 2   
10, 5    
11, 5

Now I have to count TaskPK so, I make this query

Select PK, TaskPK, Count(*)
From tbl
Group by TaskPK

It brought this result

TaskPK, Count(*)
1, 4
2, 4
5, 3

But I want slight different result

Like this

TaskPK, Count(*)
1, 2
2, 2
5, 1
1, 2
2, 2
5, 2

The above result based on consecutive data occurrence, as TaskPK start with 1 (it group together), then it change it 2 (it group together), then 5 (it group together) taskPK. But as TaskPK again shifts to 1, then it should group seperatly not link with previous occurrence of 1, this task seperately count.

Is this possible?

Best Answer

Problems of this kind are a variation of what is normally referred to as Gaps and Islands. You can see a good overview at The SQL of Gaps and Islands in Sequences

Step by step approach

This can be done with several intermediate steps. First, let's make a (derived) table, that we will store as t1, finding out at which points you start a new series of TaskPK values.

CREATE TEMPORARY TABLE t1 AS
SELECT
    PK, TaskPK, 
    case when 
        (SELECT TaskPK 
         FROM tbl t_prv 
         WHERE t_prv.PK < tbl.PK 
         ORDER BY PK DESC 
         LIMIT 1
        ) <=> TaskPK 
    then 0
    else 1 
    end AS start_of_series
FROM
    tbl ;

t1 contains:

PK | TaskPK | start_of_series
-: | -----: | --------------:
 1 |      1 |               1
 2 |      1 |               0
 3 |      2 |               1
 4 |      2 |               0
 5 |      5 |               1
 6 |      1 |               1
 7 |      1 |               0
 8 |      2 |               1
 9 |      2 |               0
10 |      5 |               1
11 |      5 |               0

NOTE: the <=> operator in MySQL is equivalent to NOT DISTINCT FROM in standard SQL. It allows for equality comparison with NULL values as well as NON-NULL ones.

Now that we know where every series starts, we can give each PK a series_number by adding together all the current and previous values of the column start_of_series (that's a cumulative sum). Without window functions, this must be done with an (awful, inefficient) subquery.

We store the result in a table t2:

CREATE TEMPORARY TABLE t2 AS
SELECT
    PK, TaskPK, 
    (SELECT 
        sum(start_of_series) 
    FROM 
        t1 
    WHERE 
        t1.PK <= tbl.PK
    ) AS series_number
FROM
    tbl 
;

The content of t2 is:

PK | TaskPK | series_number
-: | -----: | ------------:
 1 |      1 |             1
 2 |      1 |             1
 3 |      2 |             2
 4 |      2 |             2
 5 |      5 |             3
 6 |      1 |             4
 7 |      1 |             4
 8 |      2 |             5
 9 |      2 |             5
10 |      5 |             6
11 |      5 |             6

At this point, what you want to do is just GROUP BY series_number, and take the value of TaskPK (just min(TaskPK) will do this job), together with a count(*):

SELECT
    min(TaskPK) AS TaskPK, count(*) AS count
FROM
    t2
GROUP BY
    series_number
ORDER BY
    series_number

This is the result you actually wanted:

TaskPK | count
-----: | ----:
     1 |     2
     2 |     2
     5 |     1
     1 |     2
     2 |     2
     5 |     2

Putting everything together

By substituting t2 and t1 by their definitions, you can put everything together in a single query:

SELECT
    min(TaskPK) AS TaskPK, count(*) AS count
FROM
    (SELECT
        PK, TaskPK, 
        (SELECT 
            sum(start_of_series) 
        FROM 
            (SELECT
                PK, TaskPK, 
                case when 
                    (SELECT TaskPK FROM tbl t_prv WHERE t_prv.PK < tbl.PK ORDER BY PK DESC LIMIT 1) <=> TaskPK 
                then 0
                else 1 
                end AS start_of_series 
            FROM
                tbl
            ) AS t1 
        WHERE 
            t1.PK <= tbl.PK
        ) AS series_number
    FROM
        tbl
    ) AS t2
GROUP BY
    series_number
ORDER BY
    series_number
;
TaskPK | count
-----: | ----:
     1 |     2
     2 |     2
     5 |     1
     1 |     2
     2 |     2
     5 |     2

This query is quite complicated (there are five SELECT within it), and can probably be still simplified some.

You can find the whole code and data at dbfiddle here

This can be very hihgly simplified if you can use MariaDB's WINDOW functions, or MySQL 8.0 Window functions (not yet released for production as of now).