Use another file to extract part of a line that matches with grep, as well as the following line, then save to new file

grepsed

I have a file that has a DNA sequence identifier in one line and the DNA sequence in the next line right below it. The DNA sequence is long but it is in one line.

File1.fasta:

>AB244308.1.1447 233_28379 1..292
—————————————————————————————————————————————————–GTGCCAG-C-C-G-C-CGC-GGTAATAC-GG-AGGAT-GCG-A-GCG-TTATC-CGG-ATTCATT-GG-GT-TTA–AAGGGTGCGCAGG-C-G-G-GCGT-A-T————————————AA—-G-T-C-A—————————————————–G-G-G–G–TG–A-AA-TG–C-C-AC-G-G—————————————————————————————————————————————CT-C-AA—————————————————————————————————————————————————————-C-C-G-T-G-G-A–A-C—-T-G–C-C—T–T—————————-T–GA-T-A—C—————————————————-T–G-T–AT–G-T-C———————————————————————————————————————————-T-T-G-A-G-T–T—–T-AG——TT-G-A———————A-G-T-G—GG-C—————————————————————————————————————————————GG–A–ATG————————————————————————————————————————————T-A-G-C-AT–GT-A-G-CG-GT–G————–A–A-A—————————————————————————————————TG-C-AT-AG–AG-A-TG——————————-C-T——A-C——A-G-A-AC-A-CC————————————————GA–T–A–GC-GAA-G–G-C—-A——–G–C-T-C-A—CTA———A–GT-T-A—————————————————————————————————————————————–A-G——–A-C-T–GA–CG—–C———————————————TC–A-TG–C-A-CG-A–AA-G-C—-G-TG–GG-G-AT-C-A-AA-CA–GG-AT——–TA-G-ATA——–CC-C-C-C-GTA–GT-C-C——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————-

There's about 112,000 sequences in this file that follow that format. I have about 20 sequence identifiers that I'd like to pull from the fasta file and save to another file.

The sequence identifiers are in a txt file like this:

File2.txt:

AB244308.1.1447
New.ReferenceOTU151 
New.CleanUp.ReferenceOTU19 
New.ReferenceOTU59
New.CleanUp.ReferenceOTU6

In addition to pulling lines with the sequence identifiers, I'd like to pull the following line with the DNA sequence as well and print all of this to a new text file.

I've found through this answer (How to extract lines from a text file that contains strings from a list in another file?) that I would need to use grep and sed. I have also found another answer (https://stackoverflow.com/questions/7103531/how-to-get-the-part-of-file-after-the-line-that-matches-grep-expression-first) relevant to getting the line after the grep match.

Unfortunately, I am unsure how to proceed in combining these answers to get what I want.

Best Answer

As they say, there's more than one way to skin a cat:

grep -F -f File2.txt -A 1 File1.fasta > File3.log

< File2.txt sed -e 's|[.]|\\&|g; s|.*|g/^>&/.,.+1W File3.log|' | ed -s - File1.fasta

Here we are making the sequence identifiers suitable for generating an ed batch script dynamically. Which is then passed on to ed which uses it to munge your fasta file and stores the results in File3.log

Example 1

Consider this test file:

$ cat testfile
xxATGxxATG

ATGxxxATGxxx

xxATGxxxxATGxxATGxx

The code correctly counts the occurrences of ATG:

$ awk -F'ATG' 'NF{print NF-1}' testfile
2
2
3

Example 2

Using the example in the current version of the question:

$ cat >file1
ATGTGGATGGTGGGTTACAATGAAGGTGGTGAGTTCAACATGGCTGATTATCCATTCAGTGGAAGGAAACTAAGGCCTCTCATTCCAAGACCAGTCCCAGTCCCTACTACTTCTCCTAACAGCACTTCAACTATAACTCCTTCCTTAAACCGCATTCATGGTGGCAATGATTTATTTTCACAATATCATCACAATCTGCAGCAGCAAGCATCAGTAGGAGATCATAGCAAGAGATCAGAGTTGAATAATAATAATAATCCATCTGCAGCAGTTGTGGTGAGTTCAAGATGGAATCCAACACCAGAACAGTTAAGAGCACTGGAAGAATTGTATAGAAGAGGAACAAGAACACCTTCTGCTGAGCAAATCCAACAAATAACTGCCCAGCTTAGAAAATTTGGAAAAATTGAAGGCAAAAATGTTTTCTATTGGTTTCAGAATCACAAAGCCAGAGAAAGGCAAAAACGACGGCGTCAAATGGAATCAGCAGCTGCTGAGTTTGATTCTGCTATTGAAAAGAAAGACTTAGGCGCAAGTAGG


ACAGTGTTTGAAGTTGAACACACTAAAAACTGGCTACCATCTACAAATTCCAGTACCAGTACTCTTCATCTTGCAGAGGAATCTGTTTCAATTCAAAGGTCAGCAGCAGCAAAAGCAGATGGATGGCTCCAATTCGATGAAGCAGAATTACAGCAAAGAAGAAACTTTATGGAAAGGAATGCCACGTGGCATATGATGCAGTTAACTTCTTCTTGTCCTACAGCTAGCATGTCCACCACAACCACAGTAACAACTAGACTTATGGACCCAAAACTCATCAAGACCCATGAACTCAACTTATTCATTTCACCTCACACATACAAAGAAAGAGAAAACGCTTTTATCCACTTAAATACTAGTAGTACTCATCAAAATGAATCTGATCAAACCCTTCAACTTTTCCCAATAAGGAATGGAGATCATGGATGCACTGATCATCATCATCATCATCATAACATTATCAAAGAGACACAGATATCAGCTTCAGCAATCAATGCACCCAACCAGTTTATTGAGTTTCTTCCCTTGAAAAACTGA

This results in:

$ awk -F'ATG' 'NF{print NF-1}' file1
9
15

How it works

awk implicitly loops through every line of a file. Each line is divided into fields.

-F'ATG'

This tells awk to use ATG as the field separator.
NF{print NF-1}

For each non-empty line, this tells awk to print the number of fields minus 1.

(On empty lines, the number of fields, NF, is zero. So, the condition NF evaluates to false on these lines, effectively skipping over them.)

Grep – Extract Lines Starting with a Sequence and Output to Another File

Try this with GNU sed:

sed -n '/^BIHAR/p' file > new_file

or with grep:

grep '^BIHAR' file > new_file

or with awk:

awk '/^BIHAR/' file > new_file

Best Answer

Related Solutions

Awk Grep – Count Number of Substring Repetitions in a String

Example 1

Example 2

How it works

Grep – Extract Lines Starting with a Sequence and Output to Another File

Related Question