Need to extract 2 strings from specific lines of multiple files and print to new file, separated by tab

text processing

I have some Unix and Python experience, but am a bit rusty and don't know enough to figure this out myself.

I have a folder containing multiple files, from each of which I need to extract 2 specific strings and print them to a new text file with a tab between them. The files look like this:

mlu +t*CHI +f
Thu Jan 12 16:27:40 2017
mlu (08-Jan-2016) is conducting analyses on:  
  ONLY dependent tiers matching: %MOR;
****************************************
From file <adam01.cha>
MLU for Speaker: *CHI:
  MLU (xxx, yyy and www are EXCLUDED from the utterance and morpheme counts):
    Number of: utterances = 1236, morphemes = 2735
    Ratio of morphemes over utterances = 2.213
    Standard deviation = 1.300

I need to extract the file name and the value for "Ratio of morphemes over utterances". So I want to pull those strings out to create a new file that looks like this:

adam01.cha    2.213
adam02.cha    2.547
...

I can't figure out what exactly I need to do here. I'm not even sure where to start – I can't determine if egrep, awk, or sed is my best option, and how to put these into a for loop that would be able to iterate through the files properly.

Best Answer

You can use sed within a loop over each file in your current folder. You extract the relevant parts and append them via >> to a file named file like this:

for files in *; \
do sed -n -e '/^From file/ H;' \
          -e '/Ratio of morphemes over utterances/ {H; x; s/\n//g; s/From file <\(.*\)>.*Ratio of morphemes over utterances = \([0-9]*\.[0-9]*\).*/\1:    \2/g; p;}' "$files";
done >>file
Related Question