Repairing the headers of phylip bioinformatics files to accurately reflect the updated number of samples in the file(s)

bioinformaticstext processingwc

I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:

5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga

My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).

I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.

Best Answer

If I understand your requirement correctly you can use the following awk command:

awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 { $1=samples }1' input

samples will be set to the number of lines in the input file minus one (since you aren't counting the header line).

awk will then change the first column of the first line to the new sample number and print everything.


$ cat input
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga
$ awk -v samples="$(($(grep -c . input)-1))" 'NR == 1 { $1=samples }1' input
3 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatccga

With GNU awk you can use the -i flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.

Something like:

for file in *.phy; do
    awk -v samples="$(($(grep -c . "$file")-1))" 'NR == 1 { $1=samples }1' "$file" > "${file}.new"
done
Related Question