How to Add Unique Strings to Duplicates in a Tab Delimited File

awktext processing

I have a 4 column tab delimited file and the last column sometimes has duplicates. This is an excerpt from that file:

chr7    116038644       116039744       GeneA
chr7    116030947       116032047       GeneA
chr7    115846040       115847140       GeneA
chr7    115824610       115825710       GeneA
chr7    115801509       115802609       GeneA
chr7    115994986       115996086       GeneA
chrX    143933024       143934124       GeneB
chrX    143933119       143934219       GeneB
chrY    143933129       143933229       GeneC

For every set of duplicate in that column, I want to convert them into something like this (without really touching the non duplicate values in that column):

chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

How can I do this with awk or sed or Bash's for loop?

Best Answer

Try this

awk -F'\t' -v OFS='\t' '{$4=$4 "-" (++count[$4])}1' file.tsv

This will store the occurence of each value of the 4th field in a counter array count (where the value of the 4th field is used as "index"), and append the pre-incremented value of that counter to the 4th field, separated by a dash.

The above "simple" example has a disadvantage: it will add a disambiguation number even to those values in column 4 that only appear once in the file. In order to suppress that, the following double-pass approach will work (command broken over two lines via \ to improve readability):

 awk -F'\t' -v OFS='\t' 'NR==FNR{f[$4]++}\
      NR>FNR{if (f[$4]>1) {$4=$4 "-" (++count[$4])}; print}' file.tsv file.tsv

Note that the file to be processed is stated twice as argument, and will therefore be read twice.

  • The first time it is read (indicated by FNR, the per-file line counter, being equal to NR, the global line counter), we simply count how often every distinct value of column 4 appears in the file, and store that in an array f.
  • The second time the file is read, we perform the actual text processing just like in the "simple" approach and append the occurence counter to column 4, but only if the total number of occurences as found in the first pass is larger than 1.

This approach avoids buffering the entire file, which can be an advantage if the file is very large. The processing time is of course longer because the file is read two times.

As a general rule, using shell loops for text processing is rarely necessary, as awk e.g. can perform loop operations by itself in a much more efficient way.