Bash – Keeping First Instance of Duplicates

awkbashduplicateshell-scripttext processing

I have a file with multiple columns and have identified lines where specific column values (cols 3-6) have been duplicated using a bash script.

Example input:

A B C D E F G
1 2 T TACA A 3 2 Q
3 4 I R 8 2 Q
9 3 A C 9 3 P
8 3 I R 8 2 Q

I can display both instances of the repeated values. The other column values (cols 1, 2 and 7+) can be different between the 2 lines hence the need for me to view both instances.

I want to save the unique records and the first instance of the duplicated records after sorting these dups have been sorted on col 5 (any order will do) then col 1 (descending order –> largest value first).

Desired ouput:

A B C D E F G
1 2 T TACA A 3 2 Q
9 3 A C 9 3 P
8 3 I R 8 2 Q

NB: The ordering on final output is not important as it will be resorted later. Making sure the desired rows are present is what matters.

My code so far is:

tot=$(awk 'n=x[$3,$6]{print n"\n"$0;} {x[$3,$6]=$0;}' oldfilename | wc -l)  #counts duplicated records and saves overall count as $tot
if [ $tot == "0" ] 
then
    awk '{print}' oldfilename >> newfilename  #if no dups found, all lines saved in new file
else if
    awk '(!(n=x[$3,$6]{print n"\n"$0;} {x[$3,$6]=$0;})' oldfilename >> newfilename  #if dups found, unique lines in old file saved in new file
else
    awk 'n=x[$3,$6]{print n"\n"$0;} {x[$3,$6]=$0;}' oldfilename > tempfile  #save dups in tempfile
    sort -k1,1, -k5,5 tempfile  #sort tempfile on cols 1 then 5 (want descending order)                  
fi

What I am unable to do is take the first instance of each duplicate and save it in newfile and I still have errors in the above code.

Please help.

Best Answer

sort itself should suffice. First sort such that rows are "grouped" by field range 3-6, records within each group further ordered by fields 5 and 1. Pipe this to sort -u on 3-6, this disables last-resort comparison and returns the first record from each 3-6 group. Finally, pipe this to sort, this time by fields 5 and 1

sort -k3,6 -k5,5r -k1,1r file | sort -k3,6 -u | sort -k5,5r -k1,1r
A B C D E F G
1 2 T TACA A 3 2 Q
9 3 A C 9 3 P
8 3 I R 8 2 Q

Related Solutions

How to print incremental count of occurrences of unique values in column 1

The standard trick for this kind of problem in Awk is to use an associative counter array:

awk '{ print $0 "\t" ++count[$1] }'

This counts the number of times the first word in each line has been seen. It's not quite what you're asking for, since

Apple_1   1      300
Apple_2   1      500
Apple_1   500    1500

would produce

Apple_1   1      300     1
Apple_2   1      500     1
Apple_1   500    1500    2

(the count for Apple_1 isn't reset when we see Apple_2), but if the input is sorted you'll be OK.

Otherwise you'd need to track a counter and last-seen key:

awk '{ if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 "\t" counter }'

Bash – Identifying duplicate fields and print both with awk

uniq is the correct tool for that:

uniq -D -f2 file

Where:

-D - prints all duplicates
-f2 - avoid comparing the first 2 fields

Edit: If the fields 7 and above are not to be compared, you need awk:

awk 'n=x[$3,$4,$5,$6]{print n"\n"$0;} {x[$3,$4,$5,$6]=$0;}' file

The array item x[] (columns 3-6) is checked. If it's already set run the part in {...} (in the same statement the n variable is set to the value of that array item)
In the brackets {...}: The n variable and the current line $0 are printed.
Then we set the x[] array item for the next iteration to the current line contents, for later comparsion.

Best Answer

Related Solutions

How to print incremental count of occurrences of unique values in column 1

Bash – Identifying duplicate fields and print both with awk

Related Question