Bash – How to extract lines by words in specific position, not column

bashcolumnstext processingtext;

I have an input file like this:

                     v
ATOM     57  O   LYS A   7       2.254  25.484  18.942  1.00 14.46
ATOM     77  NH1AARG A   8       5.557  19.204  13.388  0.55 24.50
TER    1648      ILE C 206
HETATM 1668  O   HOH A1023      25.873  38.343   2.138  1.00 21.99
                     ^

Only lines contains A at the marked position are what I need. In most lines, A is a single character as a fifth column like the first line. However, sometimes it's on the fourth column like the second row, or in a string like the last one. Note that A as a single character can appear in positions other than 22, but I only care when it's here.

I need my output to have only lines with A, regardless it is in single or in string:

ATOM     57  O   LYS A   7       2.254  25.484  18.942  1.00 14.46
ATOM     77  NH1AARG A   8       5.557  19.204  13.388  0.55 24.50
HETATM 1668  O   HOH A1023      25.873  38.343   2.138  1.00 21.99

But sometimes I also want to extract only lines with single A, regardless its column:

ATOM     57  O   LYS A   7       2.254  25.484  18.942  1.00 14.46
ATOM     77  NH1AARG A   8       5.557  19.204  13.388  0.55 24.50

Best Answer

You can use

grep -E '^.{21}A' file

if you want to include cases like A1023, and

grep -E '^.{21}A\>' file

if you want only lines where A appears as an isolated character

NOTE: In the second example the notation \> will match any trailing empty strings.

excerpt from grep man page

The Backslash Character and Special Expressions

The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word. The symbol \w is a synonym for [_[:alnum:]] and \W is a synonym for [^_[:alnum:]].

Related Solutions

Lum – Fill empty lines in specific column with values

If your data is expressed in fixed width columns, you could do:

For the first case:

sed 's/^.\{4\}$/& -9/'

(add " -9" to lines of 4 characters).

For the second case:

sed -e '/.\{11\}/b' -e 's/$/          /;s/\(.\{10\}\).*/\1-9/'

(add up to 10 spaces and -9 to lines of less than 11 characters).

Generally, to parse lines with fixed width fields, see the FIELDWIDTHS special variable of GNU awk.

Text Processing – Adding a String to Every Column Except the First Using awk or sed

Just iterate over all fields starting with the second, and concatenate the first field to whatever you already have:

$ awk '{ for(i=2;i<=NF;i++){ $i = $1$i }}1' file
Name1 Name1String111 Name1String112
Name2 Name2String121 Name2String122 Name2String123
Name3 Name3String131 Name3String132 Name3String133 Name3String134

The 1 in the end is awk shorthand for "print the current line". You could write the same thing like this:

$ awk '{ for(i=2;i<=NF;i++){ $i = $1$i }; print}' file
Name1 Name1String111 Name1String112
Name2 Name2String121 Name2String122 Name2String123
Name3 Name3String131 Name3String132 Name3String133 Name3String134

The basic idea above can be trivially expanded to match all of your examples. NF is the special awk variable that holds the number of fields; it will always be set to however many fields are present in the current line. Then, awk allows you to refer to specific fields using a variable. So if you set i=5, then $i is equivalent to $5. This then lets you iterate over all fields using the for(i=2;i<=NF;i++) { } format which sets i to all numbers from 2 to the number of fields on this line.

Best Answer

The Backslash Character and Special Expressions

Related Solutions

Lum – Fill empty lines in specific column with values

Text Processing – Adding a String to Every Column Except the First Using awk or sed

Related Question