How to only keep line that start with a character and the line after

awkgrepsedtext processing

I have a FASTA file that has intentionally some sequences with wrong header (i.e absence of >) and some with good header. The file is well-formatted in the sense that the nucleotidic sequence is in one line.

Example :

2865958
AACTACTACAG
>hCoV-19/2832832
ACTCGGGGGG
28328332
ATTCCCCG
>hCoV-19/2789877
ACTCGGCCC

And I want to only keep the sequence with a correct header (i.e line that starts with >) like this :

>hCoV-19/2832832
ACTCGGGGGG
>hCoV-19/2789877
ACTCGGCCC

I've tried various method for it ( sed, grep, awk ) but no proper result :

awk '/^>/ { ok=index($0,"hCoV")!=0;} {if(ok) print;}' combined_v4.fa > combined_v5.fa

sed -n '/^>.*hCoV/,/^>/ {/^>.*hCoV/p ; /^>/! p}' combined_v4.fa > combined_v5.fa

grep -w ">" -A 1 combined_v4.fa > combined_v5.fa

Do you have an idea how to do that?

Best Answer

Tell grep too look for lines starting with >, and include the line following it:

grep -A1 --no-group-separator '^>' combined_v4.fa > combined_v5.fa

In case your version of grep does not support --no-group-separator, try this:

grep -A1 '^>' combined_v4.fa | grep -v '^--$' > combined_v5.fa
Related Question