Ubuntu – For loop with file names

bashcommand linefiles

I have several files (tables) named as: institute_model_scenario_river.txt

(institute, model, scenario, and river are variables.) I would like to create a for loop that will identify each files that have the same institute name and at the same time the same scenario name, in order to append the results of each different model in the same output file, using the following command:

paste filename1.txt filename2.txt > output_file.txt

I know how to create a for loop over different folder but not over file names. Anyone has an idea?

As a minimal example, the filenames could be as follow:

wbm_gfdl_rcp8p5_mississippi.txt
wbm_hadgem_rcp8p5_mississippi.txt
matsiro_gfdl_rcp8p5_mississippi.txt
matsiro_ipsl_rcp4p5_mississippi.txt
matsiro_hadgem_rcp4p5_mississippi.txt
matsiro_miroc_rcp8p5_mississippi.txt

Then, I would like to append the following files together:

wbm_gfdl_rcp8p5_mississippi.txt with
wbm_hadgem_rcp8p5_mississippi.txt

matsiro_ipsl_rcp4p5_mississippi.txt with
matsiro_hadgem_rcp4p5_mississippi.txt

matsiro_gfdl_rcp8p5_mississippi.txt with
matsiro_miroc_rcp8p5_mississippi.txt

Best Answer

If the files are all in the same dir, you can:

ls |
awk -F_ '{ i=$1; m=$2; s=$3; f[i"_"s] = f[i"_"s] " " $0 }
         END{ for(insc in f)
                printf "paste%s >out_%s.txt\n",f[insc],insc
         }'

which splits the filename on "_" (-F_), sets the variables i,m,s to the first 3 parts of the filename (institute,model,scenario), and accumulates in array f the filename. The array is indexed only by the institute and scenario, so all the models are concatenated (m isn't used). The final END prints the f array, and uses the index (institute_scenario) as the name for the output file. With your examples this produces

paste wbm_gfdl_rcp8p5_mississippi.txt wbm_hadgem_rcp8p5_mississippi.txt >out_wbm_rcp8p5.txt
paste matsiro_hadgem_rcp4p5_mississippi.txt matsiro_ipsl_rcp4p5_mississippi.txt >out_matsiro_rcp4p5.txt
paste matsiro_gfdl_rcp8p5_mississippi.txt matsiro_miroc_rcp8p5_mississippi.txt >out_matsiro_rcp8p5.txt

You then need to pipe this into the shell to have it executed. Add | sh to the last line above to do this.


To remove some columns from the input files, you need to alter the awk line that is collecting all the input filenames. In the 1st awk line:

{ i=$1; m=$2; s=$3; f[i"_"s] = f[i"_"s] " " $0 }

the filename is the "$0". For example, if you change this line into:

{ i=$1; m=$2; s=$3; f[i"_"s] = f[i"_"s] sprintf(" <(cut -f4 %s)",$0) }

then you will get the example output:

paste <(cut -f4 wbm_gfdl_rcp8p5_mississippi.txt) <(cut -f4 wbm_hadgem_rcp8p5_mississippi.txt) >out_wbm_rcp8p5.txt

but if you want to cut only the 2nd filename, it is a bit more complicated and you need this instead:

{ i=$1; m=$2; s=$3; 
  if(f[i"_"s]=="")add = $0; else add = sprintf("<(cut -f4 %s)",$0);
  f[i"_"s] = f[i"_"s] " " add }

so you will get

paste wbm_gfdl_rcp8p5_mississippi.txt <(cut -f4 wbm_hadgem_rcp8p5_mississippi.txt) >out_wbm_rcp8p5.txt

If sh does not understand the syntax <(cut ...) then replace it by bash.