Extract information from both the filename and the contents of the file

filenamesscriptingtext processing

I have generated a bunch of files which all contain just one number. I then have some information about these files in each filename. What I want to do, is to collect all file contents as a column in a new file, and then get some part of each filename as separate columns in this new file.

The filenames look like this: traj-num1-iter-num2-states-num3.gradient, where num1, num2, and num3 are just different numbers. An example of what I want:

$ cat traj-10-iter-220-states-01.gradient
-0.0014868599999999788

$ cat newfile
traj    iter     states    gradient
10      220      01        -0.0014868599999999788

I suspect this can be achieved, but I don't know how.

Best Answer

Using AWK’s FILENAME variable:

awk 'BEGIN { OFS = "\t"; print "traj", "iter", "states", "gradient"; FS="-|\\." } { gradient=$0; $0=FILENAME; print $2, $4, $6, gradient }' traj-*-iter-*-states-*.gradient

will output the requested header line, then process each traj-*-iter-*-states-*.gradient file, outputting the values extracted from its filename, and its contents.

The following variant, based on a suggestion by Olivier Dulac, extracts the header line from the filename and uses a simpler version of FS:

awk 'BEGIN { OFS = "\t"; FS="[-.]" } { contents=$0; $0=FILENAME; if (!header) { print $1, $3, $5, $7; header=1 }; print $2, $4, $6, contents }' traj-*-iter-*-states-*.gradient

You can change the glob at the end to match whichever files you’re interested in, and the header will adapt (to the first file that’s processed).

Related Question