I have a file that has a DNA sequence identifier in one line and the DNA sequence in the next line right below it. The DNA sequence is long but it is in one line.
File1.fasta:
>AB244308.1.1447 233_28379 1..292
—————————————————————————————————————————————————–GTGCCAG-C-C-G-C-CGC-GGTAATAC-GG-AGGAT-GCG-A-GCG-TTATC-CGG-ATTCATT-GG-GT-TTA–AAGGGTGCGCAGG-C-G-G-GCGT-A-T————————————AA—-G-T-C-A—————————————————–G-G-G–G–TG–A-AA-TG–C-C-AC-G-G—————————————————————————————————————————————CT-C-AA—————————————————————————————————————————————————————-C-C-G-T-G-G-A–A-C—-T-G–C-C—T–T—————————-T–GA-T-A—C—————————————————-T–G-T–AT–G-T-C———————————————————————————————————————————-T-T-G-A-G-T–T—–T-AG——TT-G-A———————A-G-T-G—GG-C—————————————————————————————————————————————GG–A–ATG————————————————————————————————————————————T-A-G-C-AT–GT-A-G-CG-GT–G————–A–A-A—————————————————————————————————TG-C-AT-AG–AG-A-TG——————————-C-T——A-C——A-G-A-AC-A-CC————————————————GA–T–A–GC-GAA-G–G-C—-A——–G–C-T-C-A—CTA———A–GT-T-A—————————————————————————————————————————————–A-G——–A-C-T–GA–CG—–C———————————————TC–A-TG–C-A-CG-A–AA-G-C—-G-TG–GG-G-AT-C-A-AA-CA–GG-AT——–TA-G-ATA——–CC-C-C-C-GTA–GT-C-C——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————-
There's about 112,000 sequences in this file that follow that format. I have about 20 sequence identifiers that I'd like to pull from the fasta file and save to another file.
The sequence identifiers are in a txt file like this:
File2.txt:
AB244308.1.1447
New.ReferenceOTU151
New.CleanUp.ReferenceOTU19
New.ReferenceOTU59
New.CleanUp.ReferenceOTU6
In addition to pulling lines with the sequence identifiers, I'd like to pull the following line with the DNA sequence as well and print all of this to a new text file.
I've found through this answer (How to extract lines from a text file that contains strings from a list in another file?) that I would need to use grep and sed. I have also found another answer (https://stackoverflow.com/questions/7103531/how-to-get-the-part-of-file-after-the-line-that-matches-grep-expression-first) relevant to getting the line after the grep match.
Unfortunately, I am unsure how to proceed in combining these answers to get what I want.
Best Answer
As they say, there's more than one way to skin a cat:
Here we are making the sequence identifiers suitable for generating an
ed
batch script dynamically. Which is then passed on toed
which uses it to munge yourfasta
file and stores the results inFile3.log