Extract information from both the filename and the contents of the file

filenamesscriptingtext processing

I have generated a bunch of files which all contain just one number. I then have some information about these files in each filename. What I want to do, is to collect all file contents as a column in a new file, and then get some part of each filename as separate columns in this new file.

The filenames look like this: traj-num1-iter-num2-states-num3.gradient, where num1, num2, and num3 are just different numbers. An example of what I want:

$ cat traj-10-iter-220-states-01.gradient
-0.0014868599999999788

$ cat newfile
traj    iter     states    gradient
10      220      01        -0.0014868599999999788

I suspect this can be achieved, but I don't know how.

Best Answer

Using AWK’s FILENAME variable:

awk 'BEGIN { OFS = "\t"; print "traj", "iter", "states", "gradient"; FS="-|\\." } { gradient=$0; $0=FILENAME; print $2, $4, $6, gradient }' traj-*-iter-*-states-*.gradient

will output the requested header line, then process each traj-*-iter-*-states-*.gradient file, outputting the values extracted from its filename, and its contents.

The following variant, based on a suggestion by Olivier Dulac, extracts the header line from the filename and uses a simpler version of FS:

awk 'BEGIN { OFS = "\t"; FS="[-.]" } { contents=$0; $0=FILENAME; if (!header) { print $1, $3, $5, $7; header=1 }; print $2, $4, $6, contents }' traj-*-iter-*-states-*.gradient

You can change the glob at the end to match whichever files you’re interested in, and the header will adapt (to the first file that’s processed).

Related Solutions

Lum – Select certain column of each file, paste to a new file

with paste under bash you can do:

paste <(cut -f 4 1.txt) <(cut -f 4 2.txt) .... <(cut -f 4 20.txt)

With a python script and any number of files (python scriptname.py column_nr file1 file2 ... filen):

#! /usr/bin/env python

# invoke with column nr to extract as first parameter followed by
# filenames. The files should all have the same number of rows

import sys

col = int(sys.argv[1])
res = {}

for file_name in sys.argv[2:]:
    for line_nr, line in enumerate(open(file_name)):
        res.setdefault(line_nr, []).append(line.strip().split('\t')[col-1])

for line_nr in sorted(res):
    print '\t'.join(res[line_nr])

Best Answer

Related Solutions

Lum – Select certain column of each file, paste to a new file

Related Question