Mysql – How to efficiently eliminate data from select when rows are different by one column value

join;MySQLselect

I have such TABLE with data:

A  | B  | C  |  D | E 
-----------------------
a1 | b1 | c1 | d1 | PA
a1 | b1 | c1 | d1 | PR
a2 | b2 | c2 | d2 | PA
a3 | b3 | c3 | d3 | PA
a3 | b3 | c3 | d3 | PR
  • row1 has its counterpart row2 (they are only different by col E)
  • row3 has no counterpart – unique
  • row4 has its counterpart row5 ( same as first statement – col E )

the query should gave me a result like this:

A  | B  | C  |  D | E 
-----------------------
a2 | b2 | c2 | d2 | PA

only unique row which has no counterpart –
no row with same set of data only which different only by value in column E

  • we can apply here a few approaches which one will be most effective if we have more rows with its counterparts as unique rows (without counterpart)
  • which join use ?
  • which one created select statement subtract or add ?

ps. an extended case

A  | B    | C  |  D                  | E 
----------------------------------------
a1 | b1   | c1 | 2017-05-16 11:46:36 | PA
a1 | b1   | c1 | 2017-05-16 11:46:37 | PR
a2 | b2   | c2 | 2017-05-17 01:34:28 | PA
a3 | null | c3 | 2017-05-12 19:14:15 | PA    
a3 | b3   | c3 | 2017-05-12 19:14:15 | PR   
a4 | b4   | c4 | 2017-05-12 19:16:15 | PA  
a4 | b4   | c4 | 2017-05-12 23:16:15 | PR
  • column A -D may contain null values (row4)
  • column D is a TIMESTAMP and it could vary from its counterpart by 1-2minutes (row1, row2)
  • we add the max difference in timestamp = example 2 minutes (row6 has no counterpart as row7 vs row6 timestamp difference is greeter than 2 minuts (4 hours)
  • we don't want any row with PR in our result

expected result:

A  | B    | C  |  D                  | E 
----------------------------------------
a2 | b2   | c2 | 2017-05-17 01:34:28 | PA
a4 | b4   | c4 | 2017-05-12 19:16:15 | PA  

Best Answer

I wouldn't use a JOIN. If you would like to look for rows that have a set of columns in common (in your case, all but one), I would just GROUP the rows by the common columns. Then, restrict which groups you want to return (using a HAVING clause):

SELECT
    A, B, C, D, min(E) AS E
FROM
    t
GROUP BY
    A, B, C, D
HAVING
    count(*) = 1 ;

The min(E) part is just a trick to be "SQL compliant". max(E), or first(E) (for the databases that have it) would do the job as well. As you're choosing just groups with a single row, the min value of a column within the group is the single value there is. In mySQL 5.6 and earlier, leaving E as a column would work as well; although this is not SQL standard and doesn't work in 5.7 (thanks to ypercubeᵀᴹ for pointing the fact that this doesn't work on 5.7).

Results are:

|  A |  B |  C |  D |  E |
|----|----|----|----|----|
| a2 | b2 | c2 | d2 | PA |

You can get everything at this SQL Fiddle


You can also take a different approach. Look for rows that literally comply with what you're looking for: that is, it does NOT exist another row with the same A, B, C and D and different E. In this case, use an EXISTS condition in your WHERE clause:

SELECT
    A, B, C, D, E
FROM
    t AS t0
WHERE
    NOT EXISTS 
    (SELECT 
         *
    FROM
        t AS t1
    WHERE
             t1.A  = t0.A
         AND t1.B  = t0.B
         AND t1.C  = t0.C
         AND t1.D  = t0.D
         AND t1.E <> t0.E
    ) ;

Older versions of MySQL would probably not handle this query too well, because EXISTS conditions are not well optimized. mySQL 5.7 does a decent job; and so do most other databases (Oracle, MS SQL Server, PostgreSQL, ...) . You do need an index on (A, B, C, D, E) (or, at least, (A, B, C, D) to be really performant.

SQL Fiddle


The two queries will not work the same in the case where you have two exact duplicate rows (which I interpret is a case not allowed by your spec). The first one will return no rows, whereas the second one will return both rows.


MySQL 5.6 Schema Setup:

CREATE TABLE t
(
    `A` varchar(2), 
    `B` varchar(2), 
    `C` varchar(2), 
    `D` varchar(2), 
    `E` varchar(2)
) ;

INSERT INTO t
    (`A`, `B`, `C`, `D`, `E`)
VALUES
    ('a1', 'b1', 'c1', 'd1', 'PA'),
    ('a1', 'b1', 'c1', 'd1', 'PR'),
    ('a2', 'b2', 'c2', 'd2', 'PA'),
    ('a3', 'b3', 'c3', 'd3', 'PA'),
    ('a3', 'b3', 'c3', 'd3', 'PR');

For your extended case, the second query is your starting point.

Change your equalities with whichever conditions meet your needs.

For instance, if your D columns is a timestamp, and you want to consider that "two times are equivalent if the difference is less than 2 minutes"... then change

t1.D = t0.D

by

abs( unix_timestamp(t1.D) - unix_timestamp(t0.D) ) < 120 -- seconds

The possibility of nulls (that work as "jokers") should be handled with:

(t1.A = t0.A or t0.A is null or t1.A is null)

If E is nullable, then the comparison is done with

NOT (t1.E <=> t0.E)

which is mySQL equivalent of (t1.E is distinct from t0.E)

This will give you:

 SELECT
     A, B, C, D, E
 FROM
     t AS t0
 WHERE
     NOT EXISTS 
     (SELECT 
          *
     FROM
         t AS t1
     WHERE
              (t1.A = t0.A or t1.A is NULL or t0.A is NULL)
          AND (t1.B = t0.B or t1.B is NULL or t0.B is NULL)
          AND (t1.C = t0.C or t1.C is NULL or t0.C is NULL)
          AND (abs( unix_timestamp(t1.D) - unix_timestamp(t0.D) ) < 120)
          AND NOT (t1.E <=> t0.E)
     ) 
     AND t0.E <> 'PR' ;

 A  | B  | C  | D                   | E 
 :- | :- | :- | :------------------ | :-
 a2 | b2 | c2 | 2017-05-17 01:34:28 | PA
 a4 | b4 | c4 | 2017-05-12 19:16:15 | PA
 

dbfiddle here

Don't expect this kind of queries to be efficient in realistic scenarios.