Ubuntu – Search output of AWK in another file

awkcommand linetext processing

I have two files fileA and fileB.

I have to extract column1 from fileA like awk '{print $1}' and then the output will be searched into other fileB and will save the matched records into a new file fileC in simple words like:

fileA:

seg1     rec1
seg2     rec2
seg3     rec3

I need to retrieve column 1 by using awk command and this column 1 is searched into fileB to retrieve the records like:

fileB:

seg1     one
seg2     two
seg3     three
seg4     four
seg5     five

From fileA, column1 data is extracted and
and this data is used to search in fileB and matched record is saved to a test file.
My output should be like this:

fileC:

seg1       one
seg2       two
seg3       three

Best Answer

Can be achieved easily with awk as follows:

awk 'NR==FNR{inFileA[$1]; next} ($1 in inFileA)' fileA fileB > write_to_fileC

result,

seg1       one
seg2       two
seg3       three

at above, first we are reading the fileA and holds the entire column1 from into an array named inFileA, then look in fileB for its first column and if it's matched with the saved column1 from fileA then goes to print entire row of fileB.

#1

Golfed one-liner:

perl -lane 'my $s;my @m=$F[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@F,$s);print(join(",",@F))' infile

Expanded full script:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 1 || die("Usage: <command> <input_file>\n");

open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");

while(<$in>) {
    my $string;
    my @fields = split(" ");
    my @matches = $fields[1] =~ /C.?/g;
    foreach(@matches) {
        $_ eq "CC" ? $string .= "C" : $string .= "C#"
    }
    push(@fields, $string);
    print(join(",", @fields) . "\n")
}

close($in);

exit

Explanation:

The input file is processed line by line;
Each line is splitted into two strings, the part before the space and the part after the space;
Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
The first, second and custom temporary string are printed, comma-separated, followed by a newline;

Sample output:

% cat infile
c32_g1_i1_3 GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
c32_g1_i1_6 AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl -ne 'my $s;my @f=split(" ");my @m=$f[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@f,$s);print(join(",",@f)."\n")' infile
c32_g1_i1_3,GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#
c32_g1_i1_6,AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#

#2

Expanded full version:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 1 || die("Usage: <command> <input_file>\n");

open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");
open(my $tmp, "+>", "tmpfile") || die("Could not create temporary file \"tmpfile\": $!\n");

select($tmp);

while(<$in>) {
    if(/^>/) {
        s/$/ /
    }
    if(my $next = <$in>) {
        if($next !~ /^>/) {
            chomp
        }
        print;
        seek($in, -length($next), 1)
    }
    else {
        print
    }
}

close($in);

seek($tmp, 0, 0);

select(STDOUT);

while(<$tmp>) {
    my $string;
    my @fields = split(/ (?!.* )|\n/);
    my @matches = $fields[1] =~ /C.?/g;
    foreach(@matches) {
        $_ eq "CC" ? $string .= "C" : $string .= "C#"
    }
    push(@fields, $string);
    print(join(",", @fields) . "\n")
}

close($tmp);

unlink("tmpfile");

exit

Explanation:

The input file is processed line by line;
If the current line starts with a > character, a space is appended to the line; if a following line exists and doesn't start with a > character, the newline character is stripped from the current line; the current line is printed to a temporary file;
The temporary file is processed line by line;
Each line is splitted into two strings, the part before the last space and the part after the last space;
Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
The first, second and custom temporary string are printed, comma-separated, followed by a newline;
The temporary file is removed;

Sample output:

% cat infile 
>c32_g1_i1_
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
>c32_g1_i1_
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl script.pl infile 
>c32_g1_i1_,3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#C#C#C#C#C#
>c32_g1_i1_,6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#C#C#CC#C#

Ubuntu – Run shell script to take output of a file and convert it into Excel format

Using Perl + ssconvert (in the gnumeric package):

perl -F'\012' -00ane 'BEGIN {$, = ","; $\ = "\n"; print("ID,Name,Age,Education")} my @f; foreach(@F) {s/.*?: +//; push(@f, $_)} print(@f)' test1.txt test2.txt | ssconvert fd://0 output.xls

The Perl command reads test1.txt and test2.txt using blank lines as record separators and a newline characters as field separators; it prints the header (Id,Name,Age,Education) and for each record and for each field strips everything before the first character following the sequence of spaces following the first : character from each field and prints the record using commas as field separators and a newline character as record separators (i.e., it converts test1.txt and test2.txt to a CSV):

% cat test1.txt
ID : 1
Name: xxxx
Age: 33
Education: Mtech

ID: 2
Name: yyyy
Age: 22
Education: bsc
% cat test2.txt
ID : 3
Name: xxxx
Age: 33
Education: Mtech

ID: 4
Name: yyyy
Age: 22
Education: bsc
% perl -F'\012' -00ane 'BEGIN {$, = ","; $\ = "\n"; print("ID,Name,Age,Education")} my @f; foreach(@F) {s/.*?: +//; push(@f, $_)} print(@f)' test1.txt test2.txt
ID,Name,Age,Education
1,xxxx,33,Mtech
2,yyyy,22,bsc
3,xxxx,33,Mtech
4,yyyy,22,bsc

The ssconvert command reads from STDIN and converts the file to an Excel spreadsheet.

If installing gnumeric to obtain ssconvert is not an option, you could use just the Perl command and import the CSV into Excel / whatever:

perl -F'\012' -00ane 'BEGIN {$, = ","; $\ = "\n"; print("ID,Name,Age,Education")} my @f; foreach(@F) {s/.*?: +//; push(@f, $_)} print(@f)' test1.txt test2.txt >output.csv

Best Answer

Related Solutions

Ubuntu – Printing pattern of “C” character

#1

#2

Ubuntu – Run shell script to take output of a file and convert it into Excel format

Related Question