I have a 4 column tab delimited file and the last column sometimes has duplicates. This is an excerpt from that file:
chr7 116038644 116039744 GeneA
chr7 116030947 116032047 GeneA
chr7 115846040 115847140 GeneA
chr7 115824610 115825710 GeneA
chr7 115801509 115802609 GeneA
chr7 115994986 115996086 GeneA
chrX 143933024 143934124 GeneB
chrX 143933119 143934219 GeneB
chrY 143933129 143933229 GeneC
For every set of duplicate in that column, I want to convert them into something like this (without really touching the non duplicate values in that column):
chr7 116038644 116039744 GeneA-1
chr7 116030947 116032047 GeneA-2
chr7 115846040 115847140 GeneA-3
chr7 115824610 115825710 GeneA-4
chr7 115801509 115802609 GeneA-5
chr7 115994986 115996086 GeneA-6
chrX 143933024 143934124 GeneB-1
chrX 143933119 143934219 GeneB-2
chrY 143933129 143933229 GeneC
How can I do this with awk
or sed
or Bash's for
loop?
Best Answer
Try this
This will store the occurence of each value of the 4th field in a counter array
count
(where the value of the 4th field is used as "index"), and append the pre-incremented value of that counter to the 4th field, separated by a dash.The above "simple" example has a disadvantage: it will add a disambiguation number even to those values in column 4 that only appear once in the file. In order to suppress that, the following double-pass approach will work (command broken over two lines via
\
to improve readability):Note that the file to be processed is stated twice as argument, and will therefore be read twice.
FNR
, the per-file line counter, being equal toNR
, the global line counter), we simply count how often every distinct value of column 4 appears in the file, and store that in an arrayf
.This approach avoids buffering the entire file, which can be an advantage if the file is very large. The processing time is of course longer because the file is read two times.
As a general rule, using shell loops for text processing is rarely necessary, as
awk
e.g. can perform loop operations by itself in a much more efficient way.