I'm in a directory with a couple thousand files, but the files I want to filter all have the following syntax: *.imputed.*_info
I want to use awk to filter out the records in each file where the 5th column of data has a value > 0.50 and I was able to do that with: awk '{if($5 >= .5) {print}}' filename
.
That too worked. I then tried to loop through all 500 or so files and concatenate records from each that match this criteria.
I tried the following but I am not getting the syntax right.
touch snplist.txt
for name in *.imputed.*_info; do
snps="awk '{if($5 >= .5) {print}}' $name"
cat snplist.txt "$snps" > snplist.txt
done
Best Answer
Your code overwrites the output file in each iteration. You also do not actually call
awk
.What you want to do is something like
This would call
awk
with all your files at once, and it would go through them one by one, in the order that the shell expands the globbing pattern. If the 5th column of any line in a file is greater or equal to 0.5, that line would be outputted (intosnplist.txt
). This works since the default action, if no action ({...}
block) is associated with a condition, is to output the current line.In cases where you have a large number of files (many thousands), this may generate an "Argument list too long" error. In that case, you may want to loop:
Note that the result of
awk
does not need to be stored in a variable. Here, it's just outputted and the loop (and therefore all commands inside the loop) is redirected intosnplist.txt
.For many thousands of files, this would be quite slow since
awk
would need to be invoked for each of them individually.To speed things up, in the cases where you have too many files for a single invocation of
awk
, you may consider usingxargs
like so:This would create a list of filenames with
printf
and pass them off toxargs
as a nul-terminated list. Thexargs
utility would take these and startawk
with as many of them as possible at once, in batches. The output of the whole pipeline would be redirected tosnplist.txt
.This
xargs
alternative is assuming that you are using a Unix, like Linux, which has anxargs
command that implements the non-standard-0
option to read nul-terminated input. It also assumes that you are using a shell, likebash
, that has a built-inprintf
utility (ksh
, the default shell on OpenBSD, would not work here as it has no such built-in utility).For the
zsh
shell (i.e. notbash
):This uses
zargs
, which is basically a reimplementation ofxargs
as a loadablezsh
shell function. Seezargs --help
(after loading the function) and thezshcontrib(1)
manual for further information about that.