How to remove a specific, duplicated line within a file

duplicatetext processinguniq

I'm looking for a way to remove one specific line from a bunch of files, but only if it occurs more than once in that file. Other lines should be kept, even if they are duplicates.

For example, a file like this where I would like to remove the duplicates of AAA

AAA
BBB
AAA
BBB
CCC

should become

AAA
BBB
BBB
CCC

I guess I should use sed but I have no idea how to write the command.

Best Answer

With GNU sed:

sed '0,/^AAA$/b;//d'

That is, let everything through (b branches off like a continue) up to the first AAA (from the 0th line (that is even before the first line) and the first one matching /^AAA$/ (which could be the first line)), and then for the remaining lines, delete every occurrence of AAA (an empty // pattern reuses the last pattern).

GNU sed is needed for the 0 address (and the ability to have other commands after the b one in the same expression, though that could be easily worked around in other implementations by using two -e expressions)

With awk:

awk '$0 != "AAA" || !n++'

(or for a regexp pattern: awk '!/^AAA$/ || !n++')

a shorthand for:

awk '! (&0 == "AAA" && count > 0) {print; count++}'

Related Solutions

Rename beginning of file name using specific text in the file itself

I would suggest that grep is a better tool for sed when what you want to do is read one line out of a file based on a match. You are welcome to substitute your sed in the loop structure below. Note that I also used the -m 1 option so that grep only bothers looking as far as the first match:

for file in *.html; do
   text="$(grep -m 1 "thetext" "$file")"
   rename "s?$text?" "$file"
done

How to remove newline character between two strings \n in unix

This sed command should help you:

sed -e ':1;/<font>[[:space:]]*$/{N;s#<font>[[:space:]]\+</font>#<font></font>#g;b1}' file

The command is looking for  tag that is followed by whitespace up to the end of line. Then the next line is pulled into the pattern space; then the replacement of a possibly existing sequence [[:space:]]\+ is performed and script restarts from the beginning. If the pattern space does not match the address /[[:space:]]*$/, i.e. some non-space content is present after a  tag, then the pattern space is printed out and cleared by the end of sed script and the process restarts.

Edit: Performance measurement.

I filled a file with the following content repeated 10k times:

<font>
dejidewji
</font>
<font>



</font><font>





</font>

totally, 620Kb. The timings of the script above on 1.4Gz A8-4500M are:

real    0m0.361s
user    0m0.356s
sys 0m0.005s

Edit2:

Your last question update is much easier solved by perl and performance is 10 times better, as showed the other answer:

perl -0777 -pe 's|<font>\s+|<font>|g;s|\s+</font>|</font>|g' file

Credits to @spasic

Best Answer

Related Solutions

Rename beginning of file name using specific text in the file itself

How to remove newline character between two strings \n in unix

Related Question