Mysql – How to efficiently eliminate data from select when rows are different by one column value

join;MySQLselect

I have such TABLE with data:

A  | B  | C  |  D | E 
-----------------------
a1 | b1 | c1 | d1 | PA
a1 | b1 | c1 | d1 | PR
a2 | b2 | c2 | d2 | PA
a3 | b3 | c3 | d3 | PA
a3 | b3 | c3 | d3 | PR

row1 has its counterpart row2 (they are only different by col E)
row3 has no counterpart – unique
row4 has its counterpart row5 ( same as first statement – col E )

the query should gave me a result like this:

A  | B  | C  |  D | E 
-----------------------
a2 | b2 | c2 | d2 | PA

only unique row which has no counterpart –
no row with same set of data only which different only by value in column E

we can apply here a few approaches which one will be most effective if we have more rows with its counterparts as unique rows (without counterpart)
which join use ?
which one created select statement subtract or add ?

ps. an extended case

A  | B    | C  |  D                  | E 
----------------------------------------
a1 | b1   | c1 | 2017-05-16 11:46:36 | PA
a1 | b1   | c1 | 2017-05-16 11:46:37 | PR
a2 | b2   | c2 | 2017-05-17 01:34:28 | PA
a3 | null | c3 | 2017-05-12 19:14:15 | PA    
a3 | b3   | c3 | 2017-05-12 19:14:15 | PR   
a4 | b4   | c4 | 2017-05-12 19:16:15 | PA  
a4 | b4   | c4 | 2017-05-12 23:16:15 | PR

column A -D may contain null values (row4)
column D is a TIMESTAMP and it could vary from its counterpart by 1-2minutes (row1, row2)
we add the max difference in timestamp = example 2 minutes (row6 has no counterpart as row7 vs row6 timestamp difference is greeter than 2 minuts (4 hours)
we don't want any row with PR in our result

expected result:

A  | B    | C  |  D                  | E 
----------------------------------------
a2 | b2   | c2 | 2017-05-17 01:34:28 | PA
a4 | b4   | c4 | 2017-05-12 19:16:15 | PA

Best Answer

I wouldn't use a JOIN. If you would like to look for rows that have a set of columns in common (in your case, all but one), I would just GROUP the rows by the common columns. Then, restrict which groups you want to return (using a HAVING clause):

SELECT
    A, B, C, D, min(E) AS E
FROM
    t
GROUP BY
    A, B, C, D
HAVING
    count(*) = 1 ;

The min(E) part is just a trick to be "SQL compliant". max(E), or first(E) (for the databases that have it) would do the job as well. As you're choosing just groups with a single row, the min value of a column within the group is the single value there is. In mySQL 5.6 and earlier, leaving E as a column would work as well; although this is not SQL standard and doesn't work in 5.7 (thanks to ypercubeᵀᴹ for pointing the fact that this doesn't work on 5.7).

Results are:

|  A |  B |  C |  D |  E |
|----|----|----|----|----|
| a2 | b2 | c2 | d2 | PA |

You can get everything at this SQL Fiddle

You can also take a different approach. Look for rows that literally comply with what you're looking for: that is, it does NOT exist another row with the same A, B, C and D and different E. In this case, use an EXISTS condition in your WHERE clause:

SELECT
    A, B, C, D, E
FROM
    t AS t0
WHERE
    NOT EXISTS 
    (SELECT 
         *
    FROM
        t AS t1
    WHERE
             t1.A  = t0.A
         AND t1.B  = t0.B
         AND t1.C  = t0.C
         AND t1.D  = t0.D
         AND t1.E <> t0.E
    ) ;

Older versions of MySQL would probably not handle this query too well, because EXISTS conditions are not well optimized. mySQL 5.7 does a decent job; and so do most other databases (Oracle, MS SQL Server, PostgreSQL, ...) . You do need an index on (A, B, C, D, E) (or, at least, (A, B, C, D) to be really performant.

SQL Fiddle

The two queries will not work the same in the case where you have two exact duplicate rows (which I interpret is a case not allowed by your spec). The first one will return no rows, whereas the second one will return both rows.

MySQL 5.6 Schema Setup:

CREATE TABLE t
(
    `A` varchar(2), 
    `B` varchar(2), 
    `C` varchar(2), 
    `D` varchar(2), 
    `E` varchar(2)
) ;

INSERT INTO t
    (`A`, `B`, `C`, `D`, `E`)
VALUES
    ('a1', 'b1', 'c1', 'd1', 'PA'),
    ('a1', 'b1', 'c1', 'd1', 'PR'),
    ('a2', 'b2', 'c2', 'd2', 'PA'),
    ('a3', 'b3', 'c3', 'd3', 'PA'),
    ('a3', 'b3', 'c3', 'd3', 'PR');

For your extended case, the second query is your starting point.

Change your equalities with whichever conditions meet your needs.

For instance, if your D columns is a timestamp, and you want to consider that "two times are equivalent if the difference is less than 2 minutes"... then change

t1.D = t0.D

abs( unix_timestamp(t1.D) - unix_timestamp(t0.D) ) < 120 -- seconds

The possibility of nulls (that work as "jokers") should be handled with:

(t1.A = t0.A or t0.A is null or t1.A is null)

If E is nullable, then the comparison is done with

NOT (t1.E <=> t0.E)

which is mySQL equivalent of (t1.E is distinct from t0.E)

This will give you:

 SELECT
     A, B, C, D, E
 FROM
     t AS t0
 WHERE
     NOT EXISTS 
     (SELECT 
          *
     FROM
         t AS t1
     WHERE
              (t1.A = t0.A or t1.A is NULL or t0.A is NULL)
          AND (t1.B = t0.B or t1.B is NULL or t0.B is NULL)
          AND (t1.C = t0.C or t1.C is NULL or t0.C is NULL)
          AND (abs( unix_timestamp(t1.D) - unix_timestamp(t0.D) ) < 120)
          AND NOT (t1.E <=> t0.E)
     ) 
     AND t0.E <> 'PR' ;

 A  | B  | C  | D                   | E 
 :- | :- | :- | :------------------ | :-
 a2 | b2 | c2 | 2017-05-17 01:34:28 | PA
 a4 | b4 | c4 | 2017-05-12 19:16:15 | PA

dbfiddle here

Don't expect this kind of queries to be efficient in realistic scenarios.

Related Solutions

Mysql – Faster query when joining data from multiple tables

Leading wildcard in LIKE is the main problem. Another problem is that the WHERE clause is spread across 3 tables.

Consider using a FULLTEXT index; this will make looking for "words" much faster. Alas, you may be stuck with LIKE in the 20-char part number.

Still the structure of the query will need to be changed...

SELECT  p.id, p.manufacturerId, m.code as manufacturerCode, p.materialNumber,
        p.partNumber, p.materialDescription, c.title as manufacturerTitle
    FROM  ( 
              ( SELECT  id
                    FROM  craft__parts
                    WHERE  partNumber like '%polymer%' 
              )
            UNION  DISTINCT 
              ( SELECT  id
                    FROM  craft__parts
                    WHERE  materialDescription like '%polymer%' 
              )
            UNION  DISTINCT 
              ( SELECT  p3.id
                    FROM  `craft__parts` AS p3
                    JOIN  `craft__manufacturers` AS m3 ON p3.manufacturerId = m3.id
                    JOIN  `craft_content` AS c3 ON m3.code = c3.field_manufacturerCode
                    WHERE  title like '%polymer%' 
              )
          ) AS p
    JOIN  `craft__manufacturers` AS m ON p.manufacturerId = m.id
    JOIN  `craft_content` AS c ON m.code = c.field_manufacturerCode
    ORDER BY  c.title asc

The goal there is to

Do each of the awful searches separately.
Combine the results (UNION DISTINCT ... p.id)
Then look up the rest of the fields

Step 1 still involves a full table scan (if using LIKE) or a quick FULLTEXT lookup (much better).

Steps 2 and should be relatively fast, assuming there were not too many matching rows.

The problem with the original query is all the bulky stuff hauled around while doing the JOINs.

Other notes...

A PRIMARY KEY is a KEY; do not redundantly say KEY(id).

Does m.code really need to be bigger than 255 characters? "Prefix" indexes are virtually useless, and may make the JOIN in the third UNION not run very fast.

Sql-server – Generate a column from different rows in sql

To address this problem, I did the following (I'm using PostgreSQL for this, but SQL Server has CTEs and RECURSIVE CTEs and window/analytic functions. I doubt if this can be done in MySQL - maybe using variables - not sure). PostgreSQL is as powerful as the big boys - MySQL is really a toy!

Created a table:

CREATE TABLE stock (sku int, datein date, dateout date, status int);

Populated it thus:

INSERT INTO stock VALUES (123, '2015-02-01', '2015-02-05', 1);
INSERT INTO stock VALUES (123, '2015-02-10', '2015-02-10', 1);
INSERT INTO stock VALUES (123, '2015-02-20', NULL,         0);

Then, I constructed the first part of my query. The actual dates to and from will obviously vary depending on the requirements - I assumed that (from the sample data), the OP wanted to look at the month of February. I constructed the following RECURSIVE CTE to construct a list of the days in February - PostgreSQL has the nifty generate_series function, but that's proprietary and I wanted to make my query portable (to the greatest extent possible). This is the best explanation of RCTEs I came across.

WITH RECURSIVE dates (test_date) AS
(
  SELECT '2015-02-01'::DATE
  UNION ALL
  SELECT test_date + 1 FROM dates
  WHERE test_date < '2015-02-28'
)

Now, I thought that the organisation of the raw data was poor, so I used CTEs to rearrange the data.

Get the outdates from the raw data:

out_dates AS
(
  SELECT dateout AS dout FROM stock
)

Get the indates from the raw data:

in_dates AS
(
  SELECT datein AS din FROM stock
)

Then I "transpose" the data - have to use two steps because one cannot use a Window function in a WHERE clause

transpose AS
(
  SELECT lag(dateout) over() AS datein, datein AS dateout FROM stock
  -- WHERE lag(datein) over() IS NOT NULL AND dateout IS NOT NULL <<-- can't use here.
  ORDER BY dateout, datein
)

And then

ranges AS
(
  SELECT * FROM transpose
  WHERE datein IS NOT NULL AND dateout IS NOT NULL
),

The result of this query is:

datein|dateout
2015-02-05|2015-02-10
2015-02-18|2015-02-20

Note that this corresponds to the in and out dates, but now the data is much simpler to handle because it's structure is more logical - it's chronologically coherent - thanks to the use of the LAG window function.

Now, we bring it all together with the next query:

x AS
(
  SELECT test_date, CASE WHEN has_match THEN 1 ELSE 0 END::int AS flag
  FROM 
  (
    SELECT test_date, 
    EXISTS
    (
      SELECT 1 
      FROM ranges 
      WHERE dates.test_date BETWEEN ranges.datein AND ranges.dateout
    ) AS has_match
  FROM dates  
  ) range_checks
)
SELECT * FROM x;

This gives a series of records for the month that are of this form:

test_date|flag
--------------
2015-02-01|0
2015-02-02|0
2015-02-03|0
2015-02-04|0
2015-02-05|1
2015-02-06|1
<other records snipped for brevity>

or use this as the final query

z AS 
(
  SELECT * FROM dates d
  LEFT JOIN ranges r
  ON (d.test_date BETWEEN r.datein AND r.dateout)
)
SELECT * FROM z;

for records of this form:

test_date|datein|dateout
2015-02-01||
2015-02-02||
2015-02-03||
2015-02-04||
2015-02-05|2015-02-05|2015-02-10
2015-02-06|2015-02-05|2015-02-10
<rest of the records snipped for brevity>

Best Answer

Related Solutions

Mysql – Faster query when joining data from multiple tables

Sql-server – Generate a column from different rows in sql

Related Question