I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:
5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga
My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).
I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.
Best Answer
If I understand your requirement correctly you can use the following
awk
command:samples
will be set to the number of lines in theinput
file minus one (since you aren't counting the header line).awk
will then change the first column of the first line to the new sample number and print everything.With GNU awk you can use the
-i
flag to modify the files in place but I would prefer to make a second set of modified files to ensure the correct changes have been made.Something like: