AWK and Perl – Extract Specific Characters from Each Line

awkperltext processing

I have a text file, and I want extract the string from each line coming after "OS="

input file line
A0A0A9PBI3_ARUDO Uncharacterized protein OS=Arundo donax OX=35708 PE=4 SV=1
K3Y356_SETIT ATP-dependent DNA helicase OS=Setaria italica OX=4555 PE=3 SV=1

Output desired

OS=Arundo donax
OS=Setaria italica

OR

Arundo donax
Setaria italica

Best Answer

Use GNU grep (or compatible) with extended regex:

grep -Eo "OS=\w+ \w+" file

or basic regex (you need to escape +

grep -o "OS=\w\+ \w\+" file
# or
grep -o "OS=\w* \w*" file

To get everything from OS= up to OX= you can use grep with perl-compatible regex (PCRE) (-P option) if available and make lookahead:

grep -Po "OS=.*(?=OX=)" file

#to also leave out "OS="
#use lookbehind
grep -Po "(?<=OS=).*(?=OX=)" file
#or Keep-out \K
grep -Po "OS=\K.*(?=OX=)" file

or use grep including OX= and remove it with sed afterwards:

grep -o "OS=.*\( OX=\)" file | sed 's/ OX=$//'

Output:

OS=Arundo donax
OS=Setaria italica
Related Question