I have a tab delimited file with a combination of text and numbers. I want to keep each line as is, but I want to keep only the six digit numbers in the 5th column if present. For example:
gene1 NM_033629 598G>A P912 syndrome 1, 192315 syndrome 2, 225750 syndrome 3 610448 score AD hom user 123456 Source
gene2 NM_000459 613G>A V115I syndrome 1 600195 score AD rec user 234567 Source
(Syndrome # is used as an example, this can be any text so not a pattern I can search and remove)
I want the output to be:
gene1 NM_033629 598G>A P912 192315 225750 610448 score AD hom user 123456 Source
gene2 NM_000459 613G>A V115I 600195 score AD rec user 234567 Source
I have 4 ways to extract the 6 digit number, however, I cannot
a. output the number in the line it originated from
b. successfully print the entire line with the one edited field. The options I have used to extract the digits are:
cat inputfile | cut -f 5 |grep -P '(? < !\d)\d{6}(?!\d)'
cat inputfile | cut -f 5 |grep -Po '(?< !\d)\d{6}(?!\d)'
cat inputfile | cut -f 5 |grep -o '[[:digit:]]*'
cat inputfile | cut -f 5 |grep -o "[0-9]\{6\}"
I know using cut for the column is incorrect but I wanted to ensure I had the extract correct as there is also a six digit number in field 9. I'm stuck on putting this all together. Thanks in advance for any suggestions
Best Answer
If I understand correctly, you want the 5th column to become the concatenation with space of all the 6 digit numbers in it.
Maybe:
Or reusing your negative look around operators:
With
awk
:grep
itself is not very adequate for the task.grep
is meant to print the lines that match a pattern. Some implementations like GNU or ast-opengrep
, orpcregrep
can extract strings from the matching lines, but that's quite limited.The only
cut
+grep
+paste
approach I can think of that could work with some restrictions would be with thepcregrep
grep
implementation:That assumes that every line of input has at least 6 fields and that the 5th field of each has in between 1 and 9 6-digit numbers.