Bash – Remove nearly duplicate lines

bashtext processinguniq

I've got a knotty problem that I can't figure out how to solve.

I have a text file containing a few million lines of text. Basically I want to run uniq, but with a twist: If two lines are identical but for a :FOO suffix, drop the line that lacks the suffix. But only if the lines are otherwise identical. And only for :FOO, not any other possible suffix.
do not want to drop /usr/bin/delta:FOO, because the line above isn't identical.

red.7
green.2
green.2:FOO
blue.6
yellow.9:FOO

I want to delete green.2, because the line below is identical but with a suffix. All other lines should be retained unchanged.

[Edit: I forgot to mention, the file is already in sorted order.]

My thoughts so far:

  • Obviously uniq is the tool to do this.
  • You can make uniq ignore a prefix, but never a suffix. (This is extremely annoying!)
  • I thought perhaps you could pretend that : is a field separator, and get cut (together with paste) to flip the field order. But no, it is apparently impossible to force cut to output a blank line if no separator is present.
  • My next thought is to go through line by line and output a 1-character prefix depending on the presence or absence of the suffix… but I can't imagine scripting that as a Bash loop being performant.

Any hints?

I may end up just using a real programming language to fix this. It seems simple enough to fix in Bash, but I've already wasted quite a lot of time failing to get it to work…

Best Answer

How about joining adjacent pairs of lines, and then using a backreference to find the non-unique prefix?

$ sed '$!N; /\(.*\)\n\1:FOO/D; P;D' file
red.7
green.2:FOO
blue.6
yellow.9:FOO

Explanation:

  • $!N - if we are not already at the last line, append the next line to the pattern space, separated by a newline
  • /\(.*\)\n - match everything up to the newline (i.e. the first of each pair of lines) and save it into a capture group
  • \1:FOO now matches whatever was captured from the first line, followed by :FOO (\1 is a backreference to the first capture group)
  • /\(.*\)\n\1:FOO/D - if the second line of each pair is the same as the first followed by :FOO, then Delete the first
  • Print and Delete the remaining line ready to start the next cycle

or neater (thanks @don_crissti)

 sed '$!N; /\(.*\)\n\1:FOO/!P;D' file

N means there are always two consecutive lines in the pattern space and sed Prints the first one of them only if the second line isn't the same as the first one plus the suffix :FOO. Then D removes the first line from the pattern space and restarts the cycle.

Related Question