Shell – How to extract the first integer from text string in a column of a tab-delimited file

awkcolumnssedshell-scripttext processing

I work in Medical Genetics and often have delimited text files where in one column (ex. column 5) there is a text string with a "mutation" in our jargon:
c.2458C>T or
c.45_46delAA or
c.749_754delinsTG

Similarly, in another file it might read:
p.Glu34* or p.Ala78_Arg80del or p.L378Ffs*11

The c. and p. are supposed to be there but might be omitted. There could be any number of non-numeric characters. The numbers are always integers and usually 1-14 or so digits long.

I want to add a new column somewhere in my file, which has only the first integer, like 2458 or 45 or 749 in the first example. Then I want to use this integer as a key value for looking up several values in a lookup table.

Some of my files have 70,000 lines so manual editing is not possible…

The more basic the solution the better. Can it be done with bash, sed, or awk?

An example table would be (as interpreted correctly below):

1       2       3       4       c.2458C>T
a   b   c   d   c.45_46delAA
a1  b2  c3  d4  p.Ala78_Arg80del

(Note: the columns are tab-delimited, not space-delimited)

There is a specification to this format by the Human Genome Variation Society. No program uses this format (I hope!) but people use it in publications and medical reports. Newer formats, like the Variant Call Format have been introduced, which are far more parsable.

Best Answer

Based on your description, suppose that we have, as input, a tab-separated file like:

$ cat file
1       2       3       4       c.2458C>T       6
a       b       c       d       c.45_46delAA or f
a1      b2      c3      d4      p.Ala78_Arg80del        f6

Using sed

To find the first integer from the fifth column:

$ sed -r 's/([^\t]*\t){4}[^[:digit:]]*([[:digit:]]+).*/\2/' file
2458
45
78

The above was tested on GNU sed. For OSX or other BSD system, try:

sed -E 's/([^\t]*\t){4}[^[:digit:]]*([[:digit:]][[:digit:]]*).*/\2/' file

Using awk

$ awk '{sub(/^[^[:digit:]]*/, "", $5); sub(/[^[:digit:]].*/, "", $5); print $5;}' file
2458
45
78
Related Question