Linux – Grep/awk/sed for lines composed of only two letters, and lines that start with a letter and meet a certain length

awkgreplinuxsedtext processing

Not sure how many people are familiar with DNA sequencing data, but if this is part of my file (lines starting with ">" are IDs, and lines starting with a letter are DNA sequences):

>NB501013:9:HJJ75BGXX:4:13609:24076:18015/2
GGGGGGGAAAAAAA
>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
CTCGTCGCATCACAAAGGGAT
>NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
>NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
CAGCCC
>NB501013:9:HJJ75BGXX:4:22611:20567:13384/2
GAATA

I would want to remove the line:
GGGGGGGAAAAAAA

Along with its sequencing ID (I know you can do that using grep -B1). But does anyone know how to remove the lines that are only composed up of two letters?

Also, for sequences that are shorter than 5 letters, I would want to remove those along with their IDs, I can't simply grep for lines longer than a certain length because all the IDs are pretty long, so I need to somehow use grep -v on lines that start with a letter (so doesn't start with ">") and longer than a certain length.

Therefore, my sample output would be:

>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
CTCGTCGCATCACAAAGGGAT
>NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
>NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
CAGCCC

Best Answer

Give grep with Perl Compatible REgexp module a try:

  • to remove two-letters combinations:

    pcregrep -Mv '>.*\n([ACGT])\1*([ACGT])\2*(\1|\2)*$' file
    

    output:

    >NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
    CTCGTCGCATCACAAAGGGAT
    >NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
    CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
    >NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
    CAGCCC
    >NB501013:9:HJJ75BGXX:4:22611:20567:13384/2
    GAATA
    
  • to remove combination of 5-letters or less:

     pcregrep -Mv '>.*\n[ACGT]{1,5}$' file
    

    output:

    >NB501013:9:HJJ75BGXX:4:13609:24076:18015/2
    GGGGGGGAAAAAAA
    >NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
    CTCGTCGCATCACAAAGGGAT
    >NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
    CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
    >NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
    CAGCCC
    
Related Question