PostgreSQL – How to Remove Duplicate Records in a Join Table

duplicationpostgresql

I have a table that has a schema like this:

create_table "questions_tags", :id => false, :force => true do |t|
        t.integer "question_id"
        t.integer "tag_id"
      end

      add_index "questions_tags", ["question_id"], :name => "index_questions_tags_on_question_id"
      add_index "questions_tags", ["tag_id"], :name => "index_questions_tags_on_tag_id"

I would like to remove records that are duplicates, i.e. they have both the same tag_id and question_id as another record.

What does the SQL look like for that?

Best Answer

In my experience (and as shown in many tests) NOT IN as demonstrated by @gsiems is rather slow and scales terribly. The inverse IN is typically faster (where you can reformulate that way, like in this case), but this query with EXISTS (doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:

DELETE FROM questions_tags q
WHERE  EXISTS (
   SELECT FROM questions_tags q1
   WHERE  q1.ctid < q.ctid
   AND    q1.question_id = q.question_id
   AND    q1.tag_id = q.tag_id
   );

Deletes every row where another row with the same (tag_id, question_id) and a smaller ctid exists. (Effectively keeps the first instance according to the physical order of tuples.) Using ctid in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).

ctid is the internal tuple identifier present in every row and necessarily unique. Further reading:

Test

I ran a test case with this table matched to your question and 100k rows:

CREATE TABLE questions_tags(
  question_id integer NOT NULL
, tag_id      integer NOT NULL
);

INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM   generate_series(1, 100000);

ANALYZE questions_tags;

Indexes do not help in this case.

Results

NOT IN
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.

EXISTS
Finishes in half a second in this SQLfiddle.

Alternatives

If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.

If you have dependencies and want to keep them, you could:

Drop all foreign keys and indexes - for performance.
SELECT survivors to a temporary table.
TRUNCATE the original.
Re-INSERT survivors.
Re-CREATE indexes and foreign keys. Views can just stay, they have no impact on performance. More here or here.

Related Solutions

DB2 – Remove All Records with Duplicates

Assuming your desired uniqueness constraint is (one,two), here is one way to do it:

DELETE FROM session.test WHERE (one, two) IN ( SELECT one, two FROM session.test GROUP BY one, two HAVING COUNT(*) > 1 )

Oracle 11g – Updating Duplicate Records with Different Data

Using the ranking ROW_NUMBER() function will work. First give row numbers to all rows in both tables, then join using these row numbers, then update:

with 
  oldt as
  ( select fileNo , folder, fileType,
           row_number() over (partition by fileNo, folder
                              order by fileType)
             as rn
    from oldtable
  ),

  newt as
  ( select fileNo , folder, fileType,
           row_number() over (partition by fileNo, folder
                              order by fileType)
             as rn
    from newtable
  ),

  upd as
  ( select 
        n.fileType,
        o.fileType as old_fileType
    from newt n
      join oldt o
      on  n.fileNo = o.fileNo
      and n.folder = o.folder
      and n.rn     = o.rn 
  ) 

update
    upd
set 
    fileType = old_fileType ;

SQLfiddle seems to be giving error for Oracle, so it has been tested in SQL Server only: SQLfiddle-test (but this syntax should be valid for Oracle, too).

Tested in Oracle, the above doesn't work, sadly. I think because statements that start with WITH can only be SELECT statements. Even if the query is rearranged (I tried several rewrites), Oracle throws various errors. The only way I managed to have it working is after adding another column in newtable and a unique constraint on it. Then the following works (nid is the added primary key column).

Tested in Oracle's Live SQL site:

update 
  ( with 
      oldt as
      ( select fileNo , folder, fileType,
               row_number() over (partition by fileNo, folder
                                  order by fileType)
                 as rn
        from oldtable
      ),
    newt as
      ( select fileNo , folder, nid,
               row_number() over (partition by fileNo, folder  
                                  order by fileType)
                 as rn
        from newtable
      ),
    upd as
      ( select 
            n.nid,
            o.fileType as old_fileType
        from newt n
          join oldt o
          on  n.fileNo = o.fileNo
          and n.folder = o.folder
          and n.rn     = o.rn 
      ) 
    select 
        up.fileType,
        ( select upd.old_fileType
          from upd 
          where upd.nid = up.nid 
        ) as old_fileType
    from newtable up
  ) x
set fileType = old_fileType ;