Number of comma-separated fields in a text file

awktext processing

I'm trying to build an awk statement to read this file:

A   1,2,3   *
A   4,5,6   **
B   1
B   4,5     *

and build a file like this:

A   1,2,3   *    3   1   0.333
A   4,5,6   **   3   2   0.666
B   1            1   0   0
B   4,5     *    2   1   0.5

In this new file, the first three columns are the same as in the original file. The fourth column must contain the number of comma-separated elements in column 2. The fifth column must contain the number of characters in column 3. The last column contains the proportion of column 5 on column 4 (i.e., column 5 divided by column 4).

I'm trying the following code:

awk '{print $1"\t"$2"\t"$3"\t"(NF","$2 -1)"\t"length($3)"\t"(length($3)/(NF","$2-1))}' file1 > file2

But I got the following output:

A   1,2,3   *    3,0   1   0.333333
A   4,5,6   **   3,3   2   0.666667
B   1            2,0   0   0
B   4,5     *    3,3   1   0.333333

I can't figure out what I'm doing wrong for column 4.

Best Answer

You seem to be hoping that (NF","$2 -1) will be treated as a function that will return the number of comma-delimited elements in field $2 - it won't. NF is always the number of fields in the record.

Instead, you can use awk's split function split($2,a,",") which splits field $2 into an array a and returns the number of elements. You can also tidy up the code by using setting the output filed separator to tab instead of using explicit "\t" in your print statement

awk '{l2=split($2,a,","); OFS="\t"; print $1, $2, $3, l2, length($3), length($3)/l2}' file1

Related Solutions

Text Processing – Merging Columns from Two Separate Files

Try this:

$ awk 'FNR==NR{a[FNR]=$2;next};{$NF=a[FNR]};1' file2 file1
A 23 8 0
A 63 9 6
B 45 3 5

Awk – Match Values Between Two Files and Create a New File

Here's one way:

$ awk -F"[, ]" 'NR==FNR{a[$1]=$1","$2; next} ($2 in a){print a[$2]","$1}' file1 file2 
1000,Brian,3044
400,Nick,4466
1010,Jason,1206

The -F"[, ]" sets the field separator to either a space or a comma. FNR is the current line number and NR the current line number of the current file. The two will be equal only while the 1st file is being read. Therefore, NR==FNR{a[$1]=$1","$2; next} will be run only on the lines of the first file and will save the 1st and 2nd fields (with a comma in between) as values in the array a whose keys are the 1st fields. Then, when the 2nd file is being read, if the 2nd field is in a, we print the value associated with it (the 1st and 2nd fields of the first file) and the 1st field of the second file.

That said, there's actually an app for this! This sort of thing is what join was made for. Sadly, since your two files are unsorted and have different delimiters, we need some tricks. If your shell supports <(), you can do:

$ join -t, -1 1 -2 2 <(sort file1) <(sed 's/ /,/g' file2 | sort -t"," -k2) 
1000,Brian,3044
1010,Jason,1206
400,Nick,4466

The join -t, -1 1 -2 2 means use , as the delimiter and join on the 1st field of file1 and the 2nd field of file2. The sed just replaces spaces with commas so we have the same delimiter in both files. The sort does what it says on the bottle: it sorts its input.

Best Answer

Related Solutions

Text Processing – Merging Columns from Two Separate Files

Awk – Match Values Between Two Files and Create a New File

Related Question