How Compare two columns of Same SQL table to find duplicate records

duplication

I have one table with two columns:

StartDateTime
EndDateTime

What will be best query to find out which rows has same (StartDateTime, EndDateTime)?

Best Answer

You did not specify a particular RDBMS (SQL Server, Oracle, etc.), but here is a quick solution that would probably work on most.

Assuming your table has some kind of unique key (primary key - like ID) to determine the difference between one row and another row, you could use a correlated exists check.

SELECT a.*
FROM table1 a
WHERE EXISTS (
        SELECT *
        FROM table1
        WHERE StartDateTime = a.StartDateTime
            AND EndDateTime = a.EndDateTime
            AND Id <> a.Id
        )

Related Solutions

PostgreSQL – How to Remove Duplicate Records in a Join Table

In my experience (and as shown in many tests) NOT IN as demonstrated by @gsiems is rather slow and scales terribly. The inverse IN is typically faster (where you can reformulate that way, like in this case), but this query with EXISTS (doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:

DELETE FROM questions_tags q
WHERE  EXISTS (
   SELECT FROM questions_tags q1
   WHERE  q1.ctid < q.ctid
   AND    q1.question_id = q.question_id
   AND    q1.tag_id = q.tag_id
   );

Deletes every row where another row with the same (tag_id, question_id) and a smaller ctid exists. (Effectively keeps the first instance according to the physical order of tuples.) Using ctid in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).

ctid is the internal tuple identifier present in every row and necessarily unique. Further reading:

Test

I ran a test case with this table matched to your question and 100k rows:

CREATE TABLE questions_tags(
  question_id integer NOT NULL
, tag_id      integer NOT NULL
);

INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM   generate_series(1, 100000);

ANALYZE questions_tags;

Indexes do not help in this case.

Results

NOT IN
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.

EXISTS
Finishes in half a second in this SQLfiddle.

Alternatives

If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.

If you have dependencies and want to keep them, you could:

Drop all foreign keys and indexes - for performance.
SELECT survivors to a temporary table.
TRUNCATE the original.
Re-INSERT survivors.
Re-CREATE indexes and foreign keys. Views can just stay, they have no impact on performance. More here or here.

Mysql – How to find duplicate values in multiple columns over different rows

Not sure why the question popped up now, but if you are still interested in an answer something like:

select g1.* 
from games g1 
where exists ( 
    select 1 
    from games g2 
    where g1.gameid <> g2.gameid 
      and least(g1.hometeam,g1.awayteam) 
        = least(g2.hometeam,g2.awayteam) 
      and greatest(g1.hometeam,g1.awayteam) 
        = greatest(g2.hometeam,g2.awayteam) 
      and abs(datediff(g1.d, g2.d)) < 2
);

should give you what you need