Extract Strings from First Column of a File – Text Processing Guide

awkgrepsedtext processing

I wanted to extract a list of numbers (string.txt) from masterFile.list. masterFile.list is separated by | and contained more than one column. I am only interested with the line, where its first column contained number matched in the string.txt file.

string.txt:

3075
3078
3076

masterFile.list

3078    |       Auxenochlorella pyrenoidosa (H.Chick) Molinari & Calvo-Perez, 2015      |                   |       authority       |
3079    |       Auxenochlorella pyrenoidosa 3078    |               |       scientific name |
3076    |       Chlorella pyrenoidosa H.Chick, 1903     |               |       authority       |
3077    |       Chlorella vulgaris var. viridis Chodat, 1913    |               |       authority
487     |       ATCC 13077      |       ATCC 13077 <type strain>        |       type material   |
460     |       DSM 23076       |       DSM 23076 <type strain> |       type material   |

expected output:

3078    |       Auxenochlorella pyrenoidosa (H.Chick) Molinari & Calvo-Perez, 2015      |                       |       authority       |
3076    |       Chlorella pyrenoidosa H.Chick, 1903     |               |       authority       |

Most of the previous post I have found only allow the extraction of single string, and limit match to first column. Is it possible to extract more than one string at a time?

Best Answer

You can use the following awk program:

awk -F' *|' 'NR==FNR{searchstr[$1]=1} NR>FNR && ($1 in searchstr) {print}' string.txt masterFile.list

As you can see, you provide both files as arguments to awk.

While the first file is processed (indicated by FNR, the per-file line-counter, being equal to NR, the global line counter), we simply register all search strings (field nr. 1 of each line, since they are the only items) in an array searchstr (however, in form of an array index, so the "value" is just a dummy value of 1).
When we come to the second file (NR is now greater than FNR), we check if the first column ($1) is contained as an array index in searchstr. If so, we print the entire line.

The idea behind this is that awk has a convenient syntax string in array which is true if string is in the list of array indices of array.

As noted by Ed Morton, you can "golf" this into

awk -F' *|' 'NR==FNR{searchstr[$1]; next} $1 in searchstr' string.txt masterFile.list

The searchstr[$1] call will define (but not fill) that array entry, and the $1 in searchstr outside of the rule block will - if evaluating to true - instruct awk to print the current line. The next instruction in the rule for processing string.txt will ensure that this part is only reached for masterFile.list

Note that I specified a full regular expression ( *|, i.e. any amount of space, followed by |) as field separator in order to ensure that the "first field" of masterFile.list really is only the number - specifying -F'|' would have meant that trailing space is included, too, and would have made the matching process more involved. If the "spaces" can actually also contain TABs, use -F'[[:space:]]*|' instead.

Related Solutions

Deleting extension only from the first column

awk solution:

awk -F'\t' '{sub(/\..+$/,"",$1)}1' OFS='\t' file

-F'\t' - field separator
sub(/\..+$/,"",$1) - removes . with following chars from the 1st field at once

The output:

ENSG00000242268 0.07563
ENSG00000270112 0.09976
ENSG00000167578 4.38608
ENSG00000273842 0.0
ENSG00000078237 4.08856

Or with simple sed approach:

sed 's/\.[0-9]*//' file

Find only certain strings (domain) extract another file

This gets the result you show in your example.

grep '^[^/]*/[^/]*/[^/]*/$' findmydomain.txt >new

These are not properly "domain names", they are URLs possibly with one or more subdomains. For example, in www.google.com, the domain name is google.com and www is just an individual node name. In the general case, resolving the TLD out of a hostname is a much more complex problem which requires knowledge of each individual TLD.

The final slash is optional, strictly speaking; @terdon's answer uses a more complex regex which solves this. As a quick and dirty fix, you could add a * after the final slash here (which would however then also match http://example.com/// with an arbitrary amount of redundant trailing slashes). The regex looks for lines with exactly three slashes in them, with optional non-slash characters before and between them.

Best Answer

Related Solutions

Deleting extension only from the first column

Find only certain strings (domain) extract another file

Related Question