I have a FASTA file that has intentionally some sequences with wrong header (i.e absence of >
) and some with good header. The file is well-formatted in the sense that the nucleotidic sequence is in one line.
Example :
2865958
AACTACTACAG
>hCoV-19/2832832
ACTCGGGGGG
28328332
ATTCCCCG
>hCoV-19/2789877
ACTCGGCCC
And I want to only keep the sequence with a correct header (i.e line that starts with >
) like this :
>hCoV-19/2832832
ACTCGGGGGG
>hCoV-19/2789877
ACTCGGCCC
I've tried various method for it ( sed, grep, awk ) but no proper result :
awk '/^>/ { ok=index($0,"hCoV")!=0;} {if(ok) print;}' combined_v4.fa > combined_v5.fa
sed -n '/^>.*hCoV/,/^>/ {/^>.*hCoV/p ; /^>/! p}' combined_v4.fa > combined_v5.fa
grep -w ">" -A 1 combined_v4.fa > combined_v5.fa
Do you have an idea how to do that?
Best Answer
Tell
grep
too look for lines starting with>
, and include the line following it:In case your version of
grep
does not support--no-group-separator
, try this: