Use another file to extract part of a line that matches with grep, as well as the following line, then save to new file

grepsed

I have a file that has a DNA sequence identifier in one line and the DNA sequence in the next line right below it. The DNA sequence is long but it is in one line.

File1.fasta:

>AB244308.1.1447 233_28379 1..292

—————————————————————————————————————————————————–GTGCCAG-C-C-G-C-CGC-GGTAATAC-GG-AGGAT-GCG-A-GCG-TTATC-CGG-ATTCATT-GG-GT-TTA–AAGGGTGCGCAGG-C-G-G-GCGT-A-T————————————AA—-G-T-C-A—————————————————–G-G-G–G–TG–A-AA-TG–C-C-AC-G-G—————————————————————————————————————————————CT-C-AA—————————————————————————————————————————————————————-C-C-G-T-G-G-A–A-C—-T-G–C-C—T–T—————————-T–GA-T-A—C—————————————————-T–G-T–AT–G-T-C———————————————————————————————————————————-T-T-G-A-G-T–T—–T-AG——TT-G-A———————A-G-T-G—GG-C—————————————————————————————————————————————GG–A–ATG————————————————————————————————————————————T-A-G-C-AT–GT-A-G-CG-GT–G————–A–A-A—————————————————————————————————TG-C-AT-AG–AG-A-TG——————————-C-T——A-C——A-G-A-AC-A-CC————————————————GA–T–A–GC-GAA-G–G-C—-A——–G–C-T-C-A—CTA———A–GT-T-A—————————————————————————————————————————————–A-G——–A-C-T–GA–CG—–C———————————————TC–A-TG–C-A-CG-A–AA-G-C—-G-TG–GG-G-AT-C-A-AA-CA–GG-AT——–TA-G-ATA——–CC-C-C-C-GTA–GT-C-C——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————-

There's about 112,000 sequences in this file that follow that format. I have about 20 sequence identifiers that I'd like to pull from the fasta file and save to another file.

The sequence identifiers are in a txt file like this:

File2.txt:

AB244308.1.1447
New.ReferenceOTU151 
New.CleanUp.ReferenceOTU19 
New.ReferenceOTU59
New.CleanUp.ReferenceOTU6

In addition to pulling lines with the sequence identifiers, I'd like to pull the following line with the DNA sequence as well and print all of this to a new text file.

I've found through this answer (How to extract lines from a text file that contains strings from a list in another file?) that I would need to use grep and sed. I have also found another answer (https://stackoverflow.com/questions/7103531/how-to-get-the-part-of-file-after-the-line-that-matches-grep-expression-first) relevant to getting the line after the grep match.

Unfortunately, I am unsure how to proceed in combining these answers to get what I want.

Best Answer

As they say, there's more than one way to skin a cat:

grep -F -f File2.txt -A 1 File1.fasta > File3.log

< File2.txt sed -e 's|[.]|\\&|g; s|.*|g/^>&/.,.+1W File3.log|' | ed -s - File1.fasta

Here we are making the sequence identifiers suitable for generating an ed batch script dynamically. Which is then passed on to ed which uses it to munge your fasta file and stores the results in File3.log

Related Question