AWK and Perl – Extract Specific Characters from Each Line

awkperltext processing

I have a text file, and I want extract the string from each line coming after "OS="

input file line
A0A0A9PBI3_ARUDO Uncharacterized protein OS=Arundo donax OX=35708 PE=4 SV=1
K3Y356_SETIT ATP-dependent DNA helicase OS=Setaria italica OX=4555 PE=3 SV=1

Output desired

OS=Arundo donax
OS=Setaria italica

Arundo donax
Setaria italica

Best Answer

Use GNU grep (or compatible) with extended regex:

grep -Eo "OS=\w+ \w+" file

or basic regex (you need to escape +

grep -o "OS=\w\+ \w\+" file
# or
grep -o "OS=\w* \w*" file

To get everything from OS= up to OX= you can use grep with perl-compatible regex (PCRE) (-P option) if available and make lookahead:

grep -Po "OS=.*(?=OX=)" file

#to also leave out "OS="
#use lookbehind
grep -Po "(?<=OS=).*(?=OX=)" file
#or Keep-out \K
grep -Po "OS=\K.*(?=OX=)" file

or use grep including OX= and remove it with sed afterwards:

grep -o "OS=.*\( OX=\)" file | sed 's/ OX=$//'

Output:

OS=Arundo donax
OS=Setaria italica

Related Solutions

Unix – Get Characters 10 to 80 in a File

I wonder how the line feed in the file should be handled. Does that count as a character or not?

If we just should take from byte 10 and print 71 bytes (A,C,T,G and linefeed) then Sato Katsura solution is the fastest (here assuming GNU dd or compatible for status=none, replace with 2> /dev/null (though that would also hide error messages if any) with other implementations):

 dd if=file bs=1 count=71 skip=9 status=none

If the line feed should be skipped then filter them out with tr -d '\n':

 tr -d '\n' < file | dd bs=1 count=70 skip=9 status=none

If the Fasta-header should be skipped it is:

 grep -v '^[;>]' file | tr -d '\n' | dd bs=1 count=70 skip=9 status=none

grep -v '^[;>]' file means skip all lines that start with ; or >.

Awk print from nth column to last

awk -v n=4 '{ for (i=n; i<=NF; i++) printf "%s%s", $i, (i<NF ? OFS : ORS)}' input

This will take n as the value of n and loop through that number through the last field NF, for each iteration it will print the current value, if that is not the last value in the line it will print OFS after it (space), if it is the last value on the line it will print ORS after it (newline).

$ echo 'vddp vddpi vss cb0 cb1 cb2 cb3 ct0 ct1 ct2 ct3' |
> awk -v n=4 '{ for (i=n; i<=NF; i++) printf "%s%s", $i, (i<NF ? OFS : ORS)}'
cb0 cb1 cb2 cb3 ct0 ct1 ct2 ct3

Best Answer

Related Solutions

Unix – Get Characters 10 to 80 in a File

Awk print from nth column to last

Related Question