Remove lines based on duplicates within one column without sort

awktext processing

I have large 3-column files (~10,000 lines) and I would like to remove lines when the contents of the third column of that line appear in the third column of another line. The files' sizes make sort a bit cumbersome, and I can't use something like the below code because the entire lines aren't identical; just the contents of column 3.

awk '!seen[$0]++' filename

Best Answer

Just change your awk command to the column you want to remove duplicated lines based on that column (in your case third column):

awk '!seen[$3]++' filename

This command is telling awk which lines to print. The variable $3 holds the entire contents of column 3 and square brackets are array access. So, for each third column of line in filename, the node of the array named seen is incremented and the line printed if the content of that node(column3) was not (!) previously set.

Above will work if your columns in input file are delimited with Spaces/Tabs, if that is something else, you will need to tell it to awk with its -F option. So, for example if columns delimited with comma(,) and wants to remove lines base on third column, use the command as following:

awk -F',' '!seen[$3]++' filename
Related Question