Bash – Match lines of a file with headers in other to obtain entire para

awkbashperl

I want help with scripting to work with two files, wherein file 1 lists amino acids which are in specific order (one below the other and also might repeat) and the second file 2 constitutes the characteristics feature listed under each amino acid. Here, I am trying to match the amino acid from list one (file 1) to obtain its characteristics features listed under the same amino acid of the second file (file 2)and copy it to an output file in the same order as mentioned in file 1.

For example
File1.txt

    Threonine
    Glutamine
    Alanine
    Asparatate
    Glutamine
    Alanine
    Threonine

File2.txt

    [ Alanine ] 
    89.1    13.7    -3.12   -10.09
    [ Asparatate ]  
    133.1   30  -2.43   -10.35
    [ Glutamine ]   
    146.1   42.7    -3.46   -10.23
    [ Threonine ]   
    119.1   28.5    -2.43   -9.99

The output I am expecting is as below:
output.txt

    [ Threonine ]   
    119.1   28.5    -2.43   -9.99
    [ Glutamine ]   
    146.1   42.7    -3.46   -10.23
    [ Alanine ] 
    89.1    13.7    -3.12   -10.09
    [ Asparatate ]  
    133.1   30  -2.43   -10.35
    [ Glutamine ]   
    146.1   42.7    -3.46   -10.23
    [ Alanine ] 
    89.1    13.7    -3.12   -10.09 
    [ Threonine ]   
    119.1   28.5    -2.43   -9.99

I have tried using the below script in awk, which works with numbers as index other than words but not for this purpose.

awk 'FNR==NR { a[ "\\[ " $1 " \\]" ]; next } /^\[/ { f=0 } { for (i in a) if ($0 ~ i) f=1 } f' file1.txt file2.txt > output.txt

I am not knowing how to modify the script to make it work on the words even. Please tell me where I am going wrong and help me execute the script to get the output as desired.

I will highly appreciate your help.

Thanks in advance.

Asha

Best Answer

Everything what you need to loop through acids in File1.txt and find matched line in File2.txt + 1 line which easy done by grep

for acid in $(sed 's/^\s*//' File1.txt)
do
    grep -FA1 "$acid" File2.txt
done > Output.txt

But if you like awk:

awk '
FNR!=NR{
    print "    [",$1,"]"
    print acids[$1]
    next
}
/\[/{
    acid=$2
    next
}
{
    acids[acid]=$0
}' File2.txt File1.txt > Output.txt

Related Solutions

Merging 2 files with based on field match

$ awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' file2 file1
aa 45 32
bb 31 15
cc 50 78

Explanation:

awk implicitly loops through each file, one line at a time. Since we gave it file2 as the first argument, it is read first. file1 is read second.

FNR==NR{a[$1]=$2;next}

NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file2. For every line in file2, we assign a[$1]=$2.

Here, a is an associative array and a[$1]=$2 means saving file2's second column, denoted $2, as a value in array a using file2's first column, $1, as the key.

next tells awk to skip the rest of the commands and start over with the next line.
($1 in a) {print $1,a[$1],$2}

If we get here, that means that we are reading the second file: file1. If we saw the first field of the line in file2, as determined by the contents of array a, then we print out a line with the values of field 2 from both files.

Compare two files and print matches – large files

If the files are sorted (the samples you posted are) then it's as simple as

join -t : File1.txt File2.txt

join pairs up lines from two files where the join field is equal. By default, the join field is the first field, the fields are output in order except that the join field is not repeated, and non-pairable lines are skipped, which is exactly what you want.

Note that if the files have Windows line endings, they appear under Unix systems to have an extra carriage return character at the end of each line. The CR is mostly visually invisible, but as far as join and other text tools are concerned, it's a character like any one else, and it means the fields of File1.txt all end with a CR whereas the ones in File2.txt don't so they don't match. You need to strip the CR, at least in File1.txt.

<File1.txt tr -d '\r' | join -t : - File2.txt

You do need to sort the files. If they aren't, then ksh/bash/zsh, you can use process substitutions. (Add tr -d '\r' | if needed.)

join -t : <(sort File1.txt) <(sort File2.txt)

In plain sh, if your Unix variant has /dev/fd (most do), you can use that instead to pipe the output of two programs through two file descriptors.

sort File2.txt | { sort File1.txt | join -t : /dev/fd/0 /dev/fd/3; } 3<&1

If you need to preserve the original order of File1.txt and it isn't sorted by the join field, then add line numbers to remember the original order, sort by the join field, join, sort by line numbers and strip the line numbers. (You can do something similar if you want to preserver the order of the other file.)

<File1.txt nl -s : |
sort -t : -k 2 |
join -t : -1 2 - <(sort File2.txt) |
sort -t : -k 2,2n |
cut -d : -f 1,3

Best Answer

Related Solutions

Merging 2 files with based on field match

Compare two files and print matches – large files

Related Question