Extract number of length n from field and return string

bioinformaticsgreptext processing

I have a tab delimited file with a combination of text and numbers. I want to keep each line as is, but I want to keep only the six digit numbers in the 5th column if present. For example:

gene1   NM_033629   598G>A  P912    syndrome  1, 192315 syndrome 2,  225750 syndrome 3 610448   score   AD  hom user    123456  Source
gene2   NM_000459   613G>A  V115I   syndrome 1 600195   score   AD  rec user    234567  Source

(Syndrome # is used as an example, this can be any text so not a pattern I can search and remove)

I want the output to be:

gene1   NM_033629   598G>A  P912    192315 225750 610448    score   AD  hom user    123456  Source
gene2   NM_000459   613G>A  V115I   600195  score   AD  rec user    234567  Source

I have 4 ways to extract the 6 digit number, however, I cannot

a. output the number in the line it originated from

b. successfully print the entire line with the one edited field. The options I have used to extract the digits are:

cat inputfile | cut -f 5 |grep -P '(? < !\d)\d{6}(?!\d)'
cat inputfile | cut -f 5 |grep -Po '(?< !\d)\d{6}(?!\d)'
cat inputfile | cut -f 5 |grep -o '[[:digit:]]*'
cat inputfile | cut -f 5 |grep -o "[0-9]\{6\}"

I know using cut for the column is incorrect but I wanted to ensure I had the extract correct as there is also a six digit number in field 9. I'm stuck on putting this all together. Thanks in advance for any suggestions

Best Answer

If I understand correctly, you want the 5th column to become the concatenation with space of all the 6 digit numbers in it.

Maybe:

perl -F'\t' -lape '
   $F[4] = join " ", grep {length == 6} ($F[4] =~ /\d+/g);
   $_ = join "\t", @F' < file

Or reusing your negative look around operators:

perl -F'\t' -lape '
   $F[4] = join " ", ($F[4] =~ /(?<!\d)\d{6}(?!\d)/g);
   $_ = join "\t", @F' < file

With awk:

awk -F'\t' -v OFS='\t' '
  {
    repl = sep = ""
    while (match($5, /[0-9]+/)) {
      if (RLENGTH == 6) {
        repl = repl sep substr($5, RSTART, RLENGTH)
        sep = " "
      }
      $5 = substr($5, RSTART+RLENGTH)
    }
    $5 = repl
    print
  }' < file

grep itself is not very adequate for the task. grep is meant to print the lines that match a pattern. Some implementations like GNU or ast-open grep, or pcregrep can extract strings from the matching lines, but that's quite limited.

The only cut+grep+paste approach I can think of that could work with some restrictions would be with the pcregrep grep implementation:

n='(?:.*?((?1)))?'
paste <(< file cut -f1-4) <(< file cut -f5 |
  pcregrep --om-separator=" " -o1 -o2 -o3 -o4 -o5 -o6 -o7 -o8 -o9 \
    "((?<!\d)\d{6}(?!\d))$n$n$n$n$n$n$n$n"
  ) <(< file cut -f6-)

That assumes that every line of input has at least 6 fields and that the 5th field of each has in between 1 and 9 6-digit numbers.

OUTPUT

line3
line4
line5

And with any other:

sed -n '/foo/G;/\n/,/goo/!d;//q;/\n/!p 
' <<\DATA
line1
foo 
line3
line4
line5
goo 
line7
DATA

OUTPUT

line3
line4
line5

Either way, though, this also quits its input as soon as it encounters the last line in your search.

How to grep and cut numbers from a file and sum them

You can take help from paste to serialize the numbers in a format suitable for bc to do the addition:

% grep "30201" logfile.txt | cut -f6 -d "|"
650
1389
945

% grep "30201" logfile.txt | cut -f6 -d "|" | paste -sd+
650+1389+945

% grep "30201" logfile.txt | cut -f6 -d "|" | paste -sd+ | bc
2984

If you have grep with PCRE, you can do it with grep alone using postive lookbehind:

% grep -Po '\|30201\|.*\|\K\d+' logfile.txt | cut -f6 -d "|" | paste -sd+ | bc
2984

With awk alone:

% awk -F'|' '$3 == 30201 {sum+=$NF}; END{print sum}' logfile.txt        
2984

-F'|' sets the field separator as |
$3 == 30201 {sum+=$NF} adds up the last field's values if the third field is 30201
END{print sum} prints the sum at the END

Best Answer

Related Solutions

Only cat from specific line X (with a pattern) to other specific line Y (with a pattern)

OUTPUT

OUTPUT

How to grep and cut numbers from a file and sum them

Related Question