Bash – Remove nearly duplicate lines

bashtext processinguniq

I've got a knotty problem that I can't figure out how to solve.

I have a text file containing a few million lines of text. Basically I want to run uniq, but with a twist: If two lines are identical but for a :FOO suffix, drop the line that lacks the suffix. But only if the lines are otherwise identical. And only for :FOO, not any other possible suffix.
do not want to drop /usr/bin/delta:FOO, because the line above isn't identical.

red.7
green.2
green.2:FOO
blue.6
yellow.9:FOO

I want to delete green.2, because the line below is identical but with a suffix. All other lines should be retained unchanged.

[Edit: I forgot to mention, the file is already in sorted order.]

My thoughts so far:

Obviously uniq is the tool to do this.
You can make uniq ignore a prefix, but never a suffix. (This is extremely annoying!)
I thought perhaps you could pretend that : is a field separator, and get cut (together with paste) to flip the field order. But no, it is apparently impossible to force cut to output a blank line if no separator is present.
My next thought is to go through line by line and output a 1-character prefix depending on the presence or absence of the suffix… but I can't imagine scripting that as a Bash loop being performant.

Any hints?

I may end up just using a real programming language to fix this. It seems simple enough to fix in Bash, but I've already wasted quite a lot of time failing to get it to work…

Best Answer

How about joining adjacent pairs of lines, and then using a backreference to find the non-unique prefix?

$ sed '$!N; /\(.*\)\n\1:FOO/D; P;D' file
red.7
green.2:FOO
blue.6
yellow.9:FOO

Explanation:

$!N - if we are not already at the last line, append the next line to the pattern space, separated by a newline
/$.*$\n - match everything up to the newline (i.e. the first of each pair of lines) and save it into a capture group
\1:FOO now matches whatever was captured from the first line, followed by :FOO (\1 is a backreference to the first capture group)
/$.*$\n\1:FOO/D - if the second line of each pair is the same as the first followed by :FOO, then Delete the first
Print and Delete the remaining line ready to start the next cycle

or neater (thanks @don_crissti)

 sed '$!N; /$.*$\n\1:FOO/!P;D' file
N means there are always two consecutive lines in the pattern space and sed Prints the first one of them only if the second line isn't the same as the first one plus the suffix :FOO. Then D removes the first line from the pattern space and restarts the cycle.

Related Solutions

linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

Try using sort with the -o/--output=FILE option instead of redirecting the output. You might also try setting the buffer-size with the -S/--buffer-size=SIZE. Also, try -s/--stable. And read the man page, it offers all of the info I gave.

The full command you can use that might work for what you're doing:

sort -us -o wordlist_unique.lst wordlist.lst

You might also want to read the following URL:

http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html

That more thoroughly explains sort than the man page.

Linux Bash – How to Select and Sort IP Address Keeping the Whole Line

This script copies the ip address from field 3 using awk to the start of the line with a "%" separator, then does the sort on the ip address now in the first field, then removes the added part.

awk '{print $3 " % " $0}' |
sort -t. -n -k1,1 -k2,2 -k3,3 -k4,4 |
sed 's/[^%]*% //'

If the field with the ip address is not a constant, you can auto-detect it on each line. Replace the awk above with:

awk '{ for(i=1;i<=NF;i++)
         if($i~/^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/)break;
       print $i " % " $0
     }' |

Best Answer

Related Solutions

linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

Linux Bash – How to Select and Sort IP Address Keeping the Whole Line

Related Question