MySQL – How to Find Groups with Duplicate Versions

aggregategroup byMySQLmysql-5.7select

I want to find all the groups in a table that have duplicate group version.
A group can have multiple group versions.
Each group version can have multiple members.
A group 'version' is defined by grpid and changeDate.
A group consider a duplicate if ALL the members (userid, pct and hobby ) in one group version matches another group version within the same group.

https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=84eb81a1a71dcee9ad3d0bd91f56120a

The groups table:

id grpid changeDate userid pct hobby group version*
1 1 2020-01-01 1 1 1 1
2 1 2020-01-02 1 1 2 2
3 1 2020-01-03 1 1 1 3
4 2 2020-01-01 1 0.5 1 4
5 2 2020-01-01 2 0.5 2 4
6 2 2020-01-02 1 0.5 1 5
7 2 2020-01-02 2 0.5 3 5
8 3 2020-01-01 1 0.5 1 6
9 3 2020-01-01 2 0.5 2 6
10 3 2020-01-02 1 0.4 1 7
11 3 2020-01-02 2 0.6 2 7
12 4 2020-01-01 1 0.6 1 8
13 4 2020-01-01 2 0.4 2 8
14 4 2020-01-02 1 0.6 1 9
15 4 2020-01-02 2 0.4 2 9
16 5 2020-01-01 1 0.2 2 10
17 5 2020-01-01 2 0.5 1 10
18 5 2020-01-01 3 0.3 2 10
19 6 2020-01-01 1 0.3 2 11
20 6 2020-01-01 2 0.5 1 11
21 6 2020-01-01 3 0.2 2 11
22 6 2020-02-01 1 0.2 2 12
23 6 2020-02-01 2 0.5 1 12
24 6 2020-02-01 3 0.3 2 12
25 6 2020-03-01 1 0.3 2 13
26 6 2020-03-01 2 0.3 1 13
27 6 2020-03-01 3 0.4 2 13
28 7 2020-01-01 1 0.3 2 14
29 7 2020-01-01 2 0.5 1 14
30 7 2020-01-01 3 0.2 2 14
31 7 2020-02-01 1 0.3 2 15
32 7 2020-02-01 2 0.5 1 15
33 7 2020-02-01 3 0.2 2 15
34 7 2020-03-01 1 0.3 2 16
35 7 2020-03-01 2 0.3 1 16
36 7 2020-03-01 3 0.4 2 16
37 8 2020-02-01 1 0.3 1 17
38 8 2020-03-01 1 0.3 1 18
39 8 2020-03-01 3 0.4 2 18

*Unique group version number just for visualization.

Result should be:

grpid
1
4
7

Explanation:

  • grpid 1 – there are 3 group versions (of 1 member) – 1 and 3 duplicate because userid, pct and hobby equal
  • grpid 2 – there are 2 group versions (of 2 members) – not duplicate because hobby not equal between 5 and 7
  • grpid 3 – there are 2 group versions (of 2 members) – not duplicate because pct is different among all members
  • grpid 4 – there are 2 group versions (of 2 members) – all members are duplicate because userid, pct and userid equal
  • grpid 5 – there is only one group of 3 members – not duplicate
  • grpid 6 – there are 3 group versions (of 3 members) – not duplicate – the pct changed for each member in the group between versions
  • grpid 7 – there are 3 group versions (of 3 members) – duplicate because userid, pct and userid equal between 28-30 and 31-33
  • grpid 8 – there are 2 group versions one w/ one member and one with 2 members – not duplicate because there is another member in that group

I am using MySQL 5.7.

Best Answer

Hopefully this should provide an answser:

This implements "common" relational division:

SELECT DISTINCT grpid
FROM groups AS g
GROUP BY grpid, changeDate
HAVING NOT EXISTS
    ( SELECT 1
      FROM groups AS gi
      WHERE gi.grpid = g.grpid
        AND gi.changeDate = g.changeDate
        AND NOT EXISTS
            ( SELECT 1 
               FROM groups AS gk
               WHERE gk.grpid = gi.grpid
                 AND gk.changeDate <> gi.changeDate
                 AND gk.userid = gi.userid
                 AND gk.pct = gi.pct
                 AND gk.hobby = gi.hobby
            )
    ) ;

Result:

grid
   1
   4
   7

This implements exact relational division, which results in even more complicated code. Your case, where the groupings/versions are determined by two columns (grpid, changeDate) makes it look like even more complicated.

Tested in dbffdle.uk

Query 1:

SELECT DISTINCT g1.grpid
FROM 
  ( SELECT grpid, changeDate
    FROM groups AS g
    GROUP BY grpid, changeDate
  ) AS g1
  JOIN
  ( SELECT grpid, changeDate
    FROM groups AS g
    GROUP BY grpid, changeDate
  ) AS g2
  ON  g1.grpid = g2.grpid
  AND g1.changeDate < g2.changeDate
WHERE NOT EXISTS
    ( SELECT 1
      FROM groups AS gi
      WHERE gi.grpid = g1.grpid
        AND gi.changeDate = g1.changeDate
        AND NOT EXISTS
            ( SELECT 1 
               FROM groups AS gk
               WHERE gk.grpid = g2.grpid
                 AND gk.changeDate = g2.changeDate
                 AND gk.userid = gi.userid
                 AND gk.pct = gi.pct
                 AND gk.hobby = gi.hobby
            )
    )
   AND NOT EXISTS
    ( SELECT 1
      FROM groups AS gi
      WHERE gi.grpid = g2.grpid
        AND gi.changeDate = g2.changeDate
        AND NOT EXISTS
            ( SELECT 1 
               FROM groups AS gk
               WHERE gk.grpid = g1.grpid
                 AND gk.changeDate = g1.changeDate
                 AND gk.userid = gi.userid
                 AND gk.pct = gi.pct
                 AND gk.hobby = gi.hobby
            )
    )
 ;

Query 2 that uses MySQL's GROUP_CONCAT function:

SELECT DISTINCT grpid
FROM
  ( SELECT grpid, changeDate, 
           GROUP_CONCAT( CONCAT_WS('-', userid, hobby, pct)
                         ORDER BY userid, hobby, pct
                         SEPARATOR '  '
                       ) AS groupdata 
    FROM groups AS gr
    GROUP BY grpid, changeDate
  ) AS g
GROUP BY grpid, groupdata
HAVING COUNT(*) > 1 ;