Shell Script – Remove Duplicate Values Not on Identical Lines

scriptingshell-scriptsorttext processinguniq

So I have a set of text that contains both the file names and an associated number. Currently it looks like this:

RR0.out -1752.142111    
RR1.out -1752.141887    
RR2.out -1752.142111    
RR3.out -1752.140319    
RR4.out -1752.140564    
RR5.out -1752.138532    
RR6.out -1752.138493    
RR7.out -1752.138493    
RR8.out -1752.138532

I want to write a script that will remove rows that are have duplicate second values. So that the output would be:

RR0.out -1752.142111    
RR1.out -1752.141887    
RR3.out -1752.140319    
RR4.out -1752.140564    
RR5.out -1752.138532    
RR6.out -1752.138493    
RR8.out -1752.138532

I have seen both sort -u and uniq used for this, but I cannot figure out how to remove lines that aren't exactly identical (which can be done with uniq but not sort) AND not adjacent to one another (which can be done with sort but not uniq).
Can anyone give me any suggestions?

So far the below code does not give me what I want.

sort -t ' ' -k 2n file > file2  
uniq -f 1 file2 > file3

Best Answer

$ sort -uk2 file
RR6.out -1752.138493
RR8.out -1752.138532
RR5.out -1752.138532
RR3.out -1752.140319
RR4.out -1752.140564
RR1.out -1752.141887
RR0.out -1752.142111

sort -u will sort the output and produce only unique values, -k2 will do the sorting/uniquing based on the second column.

In order to reorder the output based on the filenames in column one you can pipe it back into sort:

$ sort -uk2 file | sort -k1
RR0.out -1752.142111
RR1.out -1752.141887
RR3.out -1752.140319
RR4.out -1752.140564
RR5.out -1752.138532
RR6.out -1752.138493
RR8.out -1752.138532

Related Solutions

linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

Try using sort with the -o/--output=FILE option instead of redirecting the output. You might also try setting the buffer-size with the -S/--buffer-size=SIZE. Also, try -s/--stable. And read the man page, it offers all of the info I gave.

The full command you can use that might work for what you're doing:

sort -us -o wordlist_unique.lst wordlist.lst

You might also want to read the following URL:

http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html

That more thoroughly explains sort than the man page.

Bash – Remove nearly duplicate lines

How about joining adjacent pairs of lines, and then using a backreference to find the non-unique prefix?

$ sed '$!N; /\(.*\)\n\1:FOO/D; P;D' file
red.7
green.2:FOO
blue.6
yellow.9:FOO

Explanation:

$!N - if we are not already at the last line, append the next line to the pattern space, separated by a newline
/$.*$\n - match everything up to the newline (i.e. the first of each pair of lines) and save it into a capture group
\1:FOO now matches whatever was captured from the first line, followed by :FOO (\1 is a backreference to the first capture group)
/$.*$\n\1:FOO/D - if the second line of each pair is the same as the first followed by :FOO, then Delete the first
Print and Delete the remaining line ready to start the next cycle

or neater (thanks @don_crissti)

 sed '$!N; /$.*$\n\1:FOO/!P;D' file
N means there are always two consecutive lines in the pattern space and sed Prints the first one of them only if the second line isn't the same as the first one plus the suffix :FOO. Then D removes the first line from the pattern space and restarts the cycle.

Best Answer

Related Solutions

linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

Bash – Remove nearly duplicate lines

Related Question