I've got a knotty problem that I can't figure out how to solve.
I have a text file containing a few million lines of text. Basically I want to run uniq
, but with a twist: If two lines are identical but for a :FOO
suffix, drop the line that lacks the suffix. But only if the lines are otherwise identical. And only for :FOO
, not any other possible suffix.
do not want to drop /usr/bin/delta:FOO
, because the line above isn't identical.
red.7
green.2
green.2:FOO
blue.6
yellow.9:FOO
I want to delete green.2
, because the line below is identical but with a suffix. All other lines should be retained unchanged.
[Edit: I forgot to mention, the file is already in sorted order.]
My thoughts so far:
- Obviously
uniq
is the tool to do this. - You can make
uniq
ignore a prefix, but never a suffix. (This is extremely annoying!) - I thought perhaps you could pretend that
:
is a field separator, and getcut
(together withpaste
) to flip the field order. But no, it is apparently impossible to forcecut
to output a blank line if no separator is present. - My next thought is to go through line by line and output a 1-character prefix depending on the presence or absence of the suffix… but I can't imagine scripting that as a Bash loop being performant.
Any hints?
I may end up just using a real programming language to fix this. It seems simple enough to fix in Bash, but I've already wasted quite a lot of time failing to get it to work…
Best Answer
How about joining adjacent pairs of lines, and then using a backreference to find the non-unique prefix?
Explanation:
$!N
- if we are not already at the last line, append the next line to the pattern space, separated by a newline/\(.*\)\n
- match everything up to the newline (i.e. the first of each pair of lines) and save it into a capture group\1:FOO
now matches whatever was captured from the first line, followed by:FOO
(\1
is a backreference to the first capture group)/\(.*\)\n\1:FOO/D
- if the second line of each pair is the same as the first followed by:FOO
, thenD
elete the firstP
rint andD
elete the remaining line ready to start the next cycleor neater (thanks @don_crissti)