How to remove duplicate lines that begin with a pattern and the next line after that

awksedtext processing

I want to remove duplicate lines that begin with > and the next line after that.

For example:

>1
ACCGGTTTCCTTGAAATT
>2 
AACCTTCCGGTTAATT
>3 
AACCTTCCGGTTAATT
>1 
ACCGGTTTCCTTGAAATT

As you can see I have the next two duplicated lines:

AACCTTCCGGTTAATT and >1 

However I only want to remove >1 and the next line, so I want and output like:

>1
ACCGGTTTCCTTGAAATT
>2
AACCTTCCGGTTAATT
>3
AACCTTCCGGTTAATT

If I use something like:

awk '!seen[$0]++'  filename

The output is:

>1
ACCGGTTTCCTTGAAATT
>2
AACCTTCCGGTTAATT
>3

Because it removes all duplicated lines and I only want to remove duplicated lines that begin with > and the next line after that.

My true file is about several thousand of lines so I could have several names after the symbol > that could be repeated.

Any suggestions?

Best Answer

You can use getline in your awk to fetch the next line:

awk '/^>/{ if(!seen[$0]++){ print;getline;print } else { getline } }'

There is a simpler answer that also handles multiple lines:

awk '/^>/{ skip = seen[$0]++ }
     { if(!skip)print }'
Related Question