Extract Strings from First Column of a File – Text Processing Guide

awkgrepsedtext processing

I wanted to extract a list of numbers (string.txt) from masterFile.list. masterFile.list is separated by | and contained more than one column. I am only interested with the line, where its first column contained number matched in the string.txt file.

string.txt:

3075
3078
3076

masterFile.list

3078    |       Auxenochlorella pyrenoidosa (H.Chick) Molinari & Calvo-Perez, 2015      |                   |       authority       |
3079    |       Auxenochlorella pyrenoidosa 3078    |               |       scientific name |
3076    |       Chlorella pyrenoidosa H.Chick, 1903     |               |       authority       |
3077    |       Chlorella vulgaris var. viridis Chodat, 1913    |               |       authority
487     |       ATCC 13077      |       ATCC 13077 <type strain>        |       type material   |
460     |       DSM 23076       |       DSM 23076 <type strain> |       type material   |

expected output:

3078    |       Auxenochlorella pyrenoidosa (H.Chick) Molinari & Calvo-Perez, 2015      |                       |       authority       |
3076    |       Chlorella pyrenoidosa H.Chick, 1903     |               |       authority       |

Most of the previous post I have found only allow the extraction of single string, and limit match to first column. Is it possible to extract more than one string at a time?

Best Answer

You can use the following awk program:

awk -F' *|' 'NR==FNR{searchstr[$1]=1} NR>FNR && ($1 in searchstr) {print}' string.txt masterFile.list

As you can see, you provide both files as arguments to awk.

  • While the first file is processed (indicated by FNR, the per-file line-counter, being equal to NR, the global line counter), we simply register all search strings (field nr. 1 of each line, since they are the only items) in an array searchstr (however, in form of an array index, so the "value" is just a dummy value of 1).

  • When we come to the second file (NR is now greater than FNR), we check if the first column ($1) is contained as an array index in searchstr. If so, we print the entire line.

The idea behind this is that awk has a convenient syntax string in array which is true if string is in the list of array indices of array.

As noted by Ed Morton, you can "golf" this into

awk -F' *|' 'NR==FNR{searchstr[$1]; next} $1 in searchstr' string.txt masterFile.list

The searchstr[$1] call will define (but not fill) that array entry, and the $1 in searchstr outside of the rule block will - if evaluating to true - instruct awk to print the current line. The next instruction in the rule for processing string.txt will ensure that this part is only reached for masterFile.list

Note that I specified a full regular expression ( *|, i.e. any amount of space, followed by |) as field separator in order to ensure that the "first field" of masterFile.list really is only the number - specifying -F'|' would have meant that trailing space is included, too, and would have made the matching process more involved. If the "spaces" can actually also contain TABs, use -F'[[:space:]]*|' instead.

Related Question