Extract number of length n from field and return string

bioinformaticsgreptext processing

I have a tab delimited file with a combination of text and numbers. I want to keep each line as is, but I want to keep only the six digit numbers in the 5th column if present. For example:

gene1   NM_033629   598G>A  P912    syndrome  1, 192315 syndrome 2,  225750 syndrome 3 610448   score   AD  hom user    123456  Source
gene2   NM_000459   613G>A  V115I   syndrome 1 600195   score   AD  rec user    234567  Source

(Syndrome # is used as an example, this can be any text so not a pattern I can search and remove)

I want the output to be:

gene1   NM_033629   598G>A  P912    192315 225750 610448    score   AD  hom user    123456  Source
gene2   NM_000459   613G>A  V115I   600195  score   AD  rec user    234567  Source

I have 4 ways to extract the 6 digit number, however, I cannot

a. output the number in the line it originated from

b. successfully print the entire line with the one edited field. The options I have used to extract the digits are:

cat inputfile | cut -f 5 |grep -P '(? < !\d)\d{6}(?!\d)'
cat inputfile | cut -f 5 |grep -Po '(?< !\d)\d{6}(?!\d)'
cat inputfile | cut -f 5 |grep -o '[[:digit:]]*'
cat inputfile | cut -f 5 |grep -o "[0-9]\{6\}"

I know using cut for the column is incorrect but I wanted to ensure I had the extract correct as there is also a six digit number in field 9. I'm stuck on putting this all together. Thanks in advance for any suggestions

Best Answer

If I understand correctly, you want the 5th column to become the concatenation with space of all the 6 digit numbers in it.

Maybe:

perl -F'\t' -lape '
   $F[4] = join " ", grep {length == 6} ($F[4] =~ /\d+/g);
   $_ = join "\t", @F' < file

Or reusing your negative look around operators:

perl -F'\t' -lape '
   $F[4] = join " ", ($F[4] =~ /(?<!\d)\d{6}(?!\d)/g);
   $_ = join "\t", @F' < file

With awk:

awk -F'\t' -v OFS='\t' '
  {
    repl = sep = ""
    while (match($5, /[0-9]+/)) {
      if (RLENGTH == 6) {
        repl = repl sep substr($5, RSTART, RLENGTH)
        sep = " "
      }
      $5 = substr($5, RSTART+RLENGTH)
    }
    $5 = repl
    print
  }' < file

grep itself is not very adequate for the task. grep is meant to print the lines that match a pattern. Some implementations like GNU or ast-open grep, or pcregrep can extract strings from the matching lines, but that's quite limited.

The only cut+grep+paste approach I can think of that could work with some restrictions would be with the pcregrep grep implementation:

n='(?:.*?((?1)))?'
paste <(< file cut -f1-4) <(< file cut -f5 |
  pcregrep --om-separator=" " -o1 -o2 -o3 -o4 -o5 -o6 -o7 -o8 -o9 \
    "((?<!\d)\d{6}(?!\d))$n$n$n$n$n$n$n$n"
  ) <(< file cut -f6-)

That assumes that every line of input has at least 6 fields and that the 5th field of each has in between 1 and 9 6-digit numbers.

Related Question