I wanted to extract a list of numbers (string.txt
) from masterFile.list
. masterFile.list
is separated by |
and contained more than one column. I am only interested with the line, where its first column contained number matched in the string.txt
file.
string.txt:
3075
3078
3076
masterFile.list
3078 | Auxenochlorella pyrenoidosa (H.Chick) Molinari & Calvo-Perez, 2015 | | authority |
3079 | Auxenochlorella pyrenoidosa 3078 | | scientific name |
3076 | Chlorella pyrenoidosa H.Chick, 1903 | | authority |
3077 | Chlorella vulgaris var. viridis Chodat, 1913 | | authority
487 | ATCC 13077 | ATCC 13077 <type strain> | type material |
460 | DSM 23076 | DSM 23076 <type strain> | type material |
expected output:
3078 | Auxenochlorella pyrenoidosa (H.Chick) Molinari & Calvo-Perez, 2015 | | authority |
3076 | Chlorella pyrenoidosa H.Chick, 1903 | | authority |
Most of the previous post I have found only allow the extraction of single string, and limit match to first column. Is it possible to extract more than one string at a time?
Best Answer
You can use the following
awk
program:As you can see, you provide both files as arguments to
awk
.While the first file is processed (indicated by
FNR
, the per-file line-counter, being equal toNR
, the global line counter), we simply register all search strings (field nr. 1 of each line, since they are the only items) in an arraysearchstr
(however, in form of an array index, so the "value" is just a dummy value of1
).When we come to the second file (
NR
is now greater thanFNR
), we check if the first column ($1
) is contained as an array index insearchstr
. If so, we print the entire line.The idea behind this is that
awk
has a convenient syntaxstring in array
which is true ifstring
is in the list of array indices ofarray
.As noted by Ed Morton, you can "golf" this into
The
searchstr[$1]
call will define (but not fill) that array entry, and the$1 in searchstr
outside of the rule block will - if evaluating totrue
- instructawk
to print the current line. Thenext
instruction in the rule for processingstring.txt
will ensure that this part is only reached formasterFile.list
Note that I specified a full regular expression (
*|
, i.e. any amount of space, followed by|
) as field separator in order to ensure that the "first field" ofmasterFile.list
really is only the number - specifying-F'|'
would have meant that trailing space is included, too, and would have made the matching process more involved. If the "spaces" can actually also contain TABs, use-F'[[:space:]]*|'
instead.