Shell – Wrapping a loop around a ‘sed’-command processing many files in a single directory

bioinformaticscommand linesedshell-script

I have text-files containing many lines, of which some starts with ">" (it's a so-called *.fasta file, and the ">"s marks the beginning of a new information container):

>header_name1
sequence_info
>header_name2
sequence_info

I want to add the name of the file these lines are located in to the header. For example, if the file is named "1_nc.fasta", all the lines inside the file starting with > should have the label "001" added:

>001-header_name1
sequence_info
>001-header_name2
sequence_info

Someone nice provided me with this line:

sed 's/^>/>001-/g' 1_nc.fasta>001_tagged.fasta 

Accordingly, all headers in 2_nc.fasta should start with "002-", 3_nc.fasta -> "003-", and so on.

I know how to write parallel job scripts, but the jobs are done so quickly, I think a script that serially processes all files in a loop is much better. Unfortunately, I can't do this on my own.

Added twist: 11_nc.fasta and 149_nc.fasta are not available.

How can I loop that through all the 500 files in my directory?

Best Answer

This should do the trick. I break the filename at the underscore to get the numerical prefix, and then use a printf to zero-pad it out to a three digit string.

for file in *.fasta; do
    prefix="$(printf "%03d" "${file%%_*}")"
    sed  "s/^>/>$prefix-/" "$file" > "${prefix}_tagged.fasta"
done 
Related Question