Shell – How to count duplicated last columns without removing them

command lineshelltext processing

I have a file that contains 4 columns. I want to compare the last three columns and count how many times they occur without deleting any of the lines. I just want the count to be present in front of each line.

My file looks like this:

ID-jacob  4.0  6.0  42.0  
ID-elsa   5.0  8.0  45.0  
ID-fred   4.0  6.0  42.0  
ID-gerard 6.0  8.0  20.0  
ID-trudy  5.0  8.0  45.0  
ID-tessa  4.0  6.0  42.0

My desired outcome is:

3 ID-jacob  4.0  6.0  42.0  
2 ID-elsa   5.0  8.0  45.0  
3 ID-fred   4.0  6.0  42.0  
1 ID-gerard 6.0  8.0  20.0  
2 ID-trudy  5.0  8.0  45.0  
3 ID-tessa  4.0  6.0  42.0

I tried to use sort and uniq, but this only gives me the first line per duplicate lines:

cat file | sort -k2,4 | uniq -c -f1 > outputfile

Best Answer

You could run into trouble storing large files in memory, this is slightly better as it only stores matching lines, after sort has done the heavy lifting of putting the lines in order.

# Input must be sorted first, then we only need to keep matching lines in memory
# Once we reach a non-matching line we print the lines in memory, prefixed by count
# with awk, variables are unset to begin with so, we can get away without explicitly initializing
{ # S2, S3, S4 are saved field values
  if($2 == S2 && $3 == S3 && $4 == S4) {
    # if fields 2,3,4 are same as last, save line in array, increment count
    line[count++] = $0;
  } else {
    # new line with fields 2, 3, 4 different
    # print stored lines, prefixed by the count
    for(i in line) {
      print count, line[i];
    }
    # reset counter and array
    count=0;
    delete line;
    # save this line in array, increment count
    line[count++] = $0;
  }

  # store field values to compare with next line read
  S2 = $2; S3 = $3; S4 = $4;
}
END{ # on EOF we still have saved lines in array, print last lines
    for(i in line) {
      print count, line[i];
    }
}

It is customary to save awk scripts in a file.
You could use this along the lines of
sort -k2,4 file | awk -f script

3 ID-fred   4.0  6.0  42.0  
3 ID-jacob  4.0  6.0  42.0  
3 ID-tessa  4.0  6.0  42.0
2 ID-elsa   5.0  8.0  45.0  
2 ID-trudy  5.0  8.0  45.0  
1 ID-gerard 6.0  8.0  20.0

Related Solutions

Shell – removing redundancy from output columns

doit () 
{ 
    awk '{
           key=$1<=$2? $1 FS $2 : $2 FS $1; 
           if (!seen[key]) print $1,$2
           seen[key]=1
    }'
}

$ doit <test
A B
A C
A D
B C
$

(or, getting terser with it 'cause Chris Down's answer's so sweet)

awk '!seen[$1<=$2? $1 FS $2: $2 FS $1]++ {print $1,$2}'

which could be further reduced if you don't care about the spaces in your data

awk '!seen[$1<=$2? $1 FS $2: $2 FS $1]++'

)

The FS is awk's "field separator" variable, used here to guarantee the boundaries between key fields will be properly identified. My original had them run together, $1$2, which as Stephane Chazelas pointed out would have treated A BC and AB C as duplicates.

Best Answer

Related Solutions

Shell – removing redundancy from output columns

Related Question