How to remove duplicate lines that begin with a pattern and the next line after that

awksedtext processing

I want to remove duplicate lines that begin with > and the next line after that.

For example:

>1
ACCGGTTTCCTTGAAATT
>2 
AACCTTCCGGTTAATT
>3 
AACCTTCCGGTTAATT
>1 
ACCGGTTTCCTTGAAATT

As you can see I have the next two duplicated lines:

AACCTTCCGGTTAATT and >1

However I only want to remove >1 and the next line, so I want and output like:

>1
ACCGGTTTCCTTGAAATT
>2
AACCTTCCGGTTAATT
>3
AACCTTCCGGTTAATT

If I use something like:

awk '!seen[$0]++'  filename

The output is:

>1
ACCGGTTTCCTTGAAATT
>2
AACCTTCCGGTTAATT
>3

Because it removes all duplicated lines and I only want to remove duplicated lines that begin with > and the next line after that.

My true file is about several thousand of lines so I could have several names after the symbol > that could be repeated.

Any suggestions?

Best Answer

You can use getline in your awk to fetch the next line:

awk '/^>/{ if(!seen[$0]++){ print;getline;print } else { getline } }'

There is a simpler answer that also handles multiple lines:

awk '/^>/{ skip = seen[$0]++ }
     { if(!skip)print }'

Related Solutions

Remove duplicate lines while keeping the order of the lines

I doubt it will make a difference but, just in case, here's how to do the same thing in Perl:

perl -ne 'print if ++$k{$_}==1' out.txt

If the problem is keeping the unique lines in memory, that will have the same issue as the awk you tried. So, another approach could be:

cat -n out.txt | sort -k2 -k1n  | uniq -f1 | sort -nk1,1 | cut -f2-

How it works:

On a GNU system, cat -n will prepend the line number to each line following some amount of spaces and followed by a <tab> character. cat pipes this input representation to sort.
sort's -k2 option instructs it only to consider the characters from the second field until the end of the line when sorting, and sort splits fields by default on white-space (or cat's inserted spaces and <tab>).
When followed by -k1n, sort considers the 2nd field first, and then secondly—in the case of identical -k2 fields—it considers the 1st field but as sorted numerically. So repeated lines will be sorted together but in the order they appeared.
The results are piped to uniq—which is told to ignore the first field (-f1 - and also as separated by whitespace)—and which results in a list of unique lines in the original file and is piped back to sort.
This time sort sorts on the first field (cat's inserted line number) numerically, getting the sort order back to what it was in the original file and pipes these results to cut.
Lastly, cut removes the line numbers that were inserted by cat. This is effected by cut printing only from the 2nd field through the end of the line (and cut's default delimiter is a <tab> character).

To illustrate:

$ cat file
bb
aa
bb
dd
cc
dd
aa
bb
cc
$ cat -n file | sort -k2 | uniq -f1 | sort -k1 | cut -f2-
bb
aa    
dd
cc

Extract nth line matching pattern and the next N lines

Here's one way with awk:

awk -vN=85 -vM=5 'BEGIN{c=0}
/PATTERN/{c++
{if (c==N) {l=NR;last=NR+M}}
}{if (NR<=last && NR>=l) print}' infile

Where N is the Nth line matching PATTERN and M is the number of lines that follow. It sets a counter and when the Nth line matching is encountered it saves the line number. It then prints the lines from the current NR up to NR+M.

For the record, that's how you do it with sed (gnu sed syntax):

sed -nE '/PATTERN/{x;/\n{84}/{x;$!N;$!N;$!N;$!N;$!N;p;q};s/.*/&\n/;x}' infile

This is using the hold space to count.
Each time it encounters a line matching PATTERN it exchanges buffers and checks if there are N-1 occurrences of \newline character in the hold buffer. If the check is successful it exchanges again, pulls in the next M lines with the $!N command and prints the pattern space then quits.
Otherwise it just adds another \newline char to the hold space and exchanges back.
This solution is less convenient as it quickly becomes cumbersome when M is a big number and requires some printf-fu to build up a sed script (not to mention the pattern and hold space limits with some seds).

Best Answer

Related Solutions

Remove duplicate lines while keeping the order of the lines

Extract nth line matching pattern and the next N lines

Related Question