How to only keep line that start with a character and the line after

awkgrepsedtext processing

I have a FASTA file that has intentionally some sequences with wrong header (i.e absence of >) and some with good header. The file is well-formatted in the sense that the nucleotidic sequence is in one line.

Example :

2865958
AACTACTACAG
>hCoV-19/2832832
ACTCGGGGGG
28328332
ATTCCCCG
>hCoV-19/2789877
ACTCGGCCC

And I want to only keep the sequence with a correct header (i.e line that starts with >) like this :

>hCoV-19/2832832
ACTCGGGGGG
>hCoV-19/2789877
ACTCGGCCC

I've tried various method for it ( sed, grep, awk ) but no proper result :

awk '/^>/ { ok=index($0,"hCoV")!=0;} {if(ok) print;}' combined_v4.fa > combined_v5.fa

sed -n '/^>.*hCoV/,/^>/ {/^>.*hCoV/p ; /^>/! p}' combined_v4.fa > combined_v5.fa

grep -w ">" -A 1 combined_v4.fa > combined_v5.fa

Do you have an idea how to do that?

Best Answer

Tell grep too look for lines starting with >, and include the line following it:

grep -A1 --no-group-separator '^>' combined_v4.fa > combined_v5.fa

In case your version of grep does not support --no-group-separator, try this:

grep -A1 '^>' combined_v4.fa | grep -v '^--$' > combined_v5.fa

Nested Braces

Let's take this as a test file with lots of nested braces:

a{b{c}d}e
1{2
}3{
}
5

Here is a modification to handle nested braces:

$ sed ':again;$!N;$!b again; :b; s/{[^{}]*}//g; t b' file2
ae
13
5

Explanation:

:again;$!N;$!b again

This is the same as before: it reads in the whole file.
:b

This defines a label b.
s/{[^{}]*}//g

This removes text in braces as long as the text contains no inner braces.
t b

If the above substitute command resulted in a change, jump back to label b. In this way, the substitute command is repeated until all brace-groups are removed.

awk – Multiline Regexp with Grep, Sed, Awk, and Perl

You can do this with Awk by setting the "Record Separator" variable to be a regex matching at least two consecutive newline characters:

awk -v RS='\n\n+' '/1.*2.*3/' file.txt

You can also set the "Field Separator" to be a single newline character:

awk -v RS='\n\n+' -F '\n' '$1 == "LINE OF TEXT 1" && $2 == "LINE OF TEXT 2" && $3 == "LINE OF TEXT 3"' file.txt

Broken up for readability:

awk -v RS='\n\n+' -F '\n' '
  $1 == "LINE OF TEXT 1" &&
  $2 == "LINE OF TEXT 2" &&
  $3 == "LINE OF TEXT 3"
' file.txt

With your requirement of only printing the filename if a match is found, you can do this like so:

awk -v RS='\n\n+' -F '\n' '
  $1 == "LINE OF TEXT 1" &&
  $2 == "LINE OF TEXT 2" &&
  $3 == "LINE OF TEXT 3" {
    match++
  }
  END {
    if (match) {
      print FILENAME
    }
' file.txt

But considering you are talking about using find in combination with awk, I'd recommend just using Awk for the exit status and using find for the printing:

find . -type f -exec awk -v RS='\n\n+' -F '\n' '
  $1 ~ /LINE OF TEXT 1/ &&
  $2 ~ /LINE OF TEXT 2/ &&
  $3 ~ /LINE OF TEXT 3/ {
    exit 0
  }
  END { exit 1 }
' {} \; -print

That way, if you want to do something else before printing (some other find primary), you're already set up to do so.

Best Answer

Related Solutions

Text Processing – How to Delete All Text Between Curly Brackets in a Multiline Text File

Nested Braces

awk – Multiline Regexp with Grep, Sed, Awk, and Perl

Related Question