How to merge duplicate rows in a single table

duplicationmerge

I have a table with duplicate barcodes (barcode) and carts (cart) and (Quantity_counted_IN) and (Quantity_Counted_out).

I want to merge all duplicate records that have the same barcode and cart number to show only one barcode for that cart number but sum(Quantity_counted_IN) and sum(Quantity_Counted_out) at the same time.

Example:

           cart   barcode         Quantity_Counted_In  Quantity_Counted_Out
 Row 1        1   610708542209                     -7                    20
 Row 2        1   610708542209                     -4                    16
 Row 3        2   610708542209                     -3                    17

I want it to read:

 Row 1        1   610708542209                    -11                    36 
 Row 2        2   610708542209                     -3                    17

How can I do this?

Best Answer

This is what's known as an aggregate query,

SELECT cart, barcode,
       SUM(counted_in) AS counted_in,
       SUM(counted_out) AS counted_out
FROM tbl
GROUP BY cart, barcode;

Using GROUP BY like this returns one record for every unique occurrence of (cart, barcode) and the totals of counted_in and counted_out for each.

Related Solutions

SQL Server – How to Remove Duplicate Records in a Table

The syntax is just slightly different:

DELETE q
FROM dbo.questions_tags AS q
WHERE EXISTS 
(
   SELECT 1
   FROM   dbo.questions_tags AS q1
   WHERE  q1.ctid < q.ctid
   AND    q.question_id = q1.question_id
   AND    q.tag_id = q1.tag_id
);

Personally, I prefer to use a CTE. Then I can easily swap in a SELECT to validate what I am about to delete, and easily change the where clause to validate what I am going to keep:

;WITH q AS 
(
  SELECT question_id, tag_id, ctid,
    rn = ROW_NUMBER() OVER (PARTITION BY question_id, tag_id ORDER BY ctid)
  FROM dbo.question_tags
)
--DELETE q
SELECT * FROM q 
WHERE rn > 1; -- to show keepers, change to = 1

I believe these semantics match yours, but please test.

Then, of course, add a proper key constraint before you let anybody insert any new nonsense into this table.

PostgreSQL – How to Merge Two Huge Tables Keeping Only Unique Rows

The basic problem can be solved with various simple queries. Considering all columns:

CREATE TABLE tbl3 AS
TABLE tbl1
UNION TABLE tbl2;

Given this additional information:

All columns except the id column should be considered for the unique check.

And:

I don't need to preserve the ID column.

Just drop the id column, then you can proceed with the simple query above.

I would import to temporary tables (much faster, less overhead) and only write the final result (tbl3) to a regular table - in one session because temporary tables are dropped automatically at the end of the session.

CREATE TEMP TABLE tbl1 ( <columns from above, without id> );
COPY tbl1 FROM '/path/to/file1';

CREATE TEMP TABLE tbl2 ( <columns from above, without id> );
COPY tbl2 FROM '/path/to/file2';

Alternatively, to preserve the input tables across sessions, you could use unlogged tables.

For best performance create and fill the target with CREATE TABLE AS and add the PK constraint in the same transaction:

BEGIN;

CREATE SEQUENCE tbl3_tbl3_id_seq;

CREATE TABLE tbl3 AS 
SELECT nextval('tbl3_tbl3_id_seq'::regclass)::int AS tbl3_id, *
FROM  (TABLE tbl1 UNION TABLE tbl2 ) sub;

ALTER TABLE tbl3
   ADD CONSTRAINT tbl3_pkey PRIMARY KEY(tbl3_id)
 , ALTER COLUMN tbl3_id SET DEFAULT nextval('tbl3_tbl3_id_seq'::regclass);

ALTER SEQUENCE tbl3_tbl3_id_seq OWNED BY tbl3.tbl3_id;    

COMMIT;

Replace all occurrences of "tbl3" with our desired table name.

Detailed explanation in this related answer:

What causes large INSERT to slow down and disk usage to explode?

I added a serial column (tbl3_id) as surrogate PK to the target table. Adding the actual PK constraint at the end (of the same session) is the fastest way.

Optimizing bulk update performance in PostgreSQL

Before you do it, test whether double precision is the best data type for all those columns. Chances are, some of them could be integer (cheaper for whole numbers) or must really be numeric (loss-less). If so, adapt your temp tables to begin with.

Best Answer

Related Solutions

SQL Server – How to Remove Duplicate Records in a Table

PostgreSQL – How to Merge Two Huge Tables Keeping Only Unique Rows

Related Question