Remove lines based on duplicates within one column without sort

awktext processing

I have large 3-column files (~10,000 lines) and I would like to remove lines when the contents of the third column of that line appear in the third column of another line. The files' sizes make sort a bit cumbersome, and I can't use something like the below code because the entire lines aren't identical; just the contents of column 3.

awk '!seen[$0]++' filename

Best Answer

Just change your awk command to the column you want to remove duplicated lines based on that column (in your case third column):

awk '!seen[$3]++' filename

This command is telling awk which lines to print. The variable $3 holds the entire contents of column 3 and square brackets are array access. So, for each third column of line in filename, the node of the array named seen is incremented and the line printed if the content of that node(column3) was not (!) previously set.

Above will work if your columns in input file are delimited with Spaces/Tabs, if that is something else, you will need to tell it to awk with its -F option. So, for example if columns delimited with comma(,) and wants to remove lines base on third column, use the command as following:

awk -F',' '!seen[$3]++' filename

Related Solutions

Awk to remove line if argument is encountered in a specific column

#!/bin/sh
awk -v value="$1" -v column="$2" '
  $column == value {++removed; next}
  1 {print}
  END {print removed " lines removed" >"/dev/stderr"}
' <File.txt >File.txt.tmp &&
mv File.txt.tmp File.txt

Explanations:

-v value="$1" sets the awk variable value to the shell script's first argument.
For each line, if the condition $column == value is true, the code in the braces is executed.
- $column is the content of the column number column (starting at 1).
- ++removed increments a counter of removed lines. The variable starts at 0.
- next skips to the next input line, so that the print instruction won't be executed when the condition is true.
1 {print} prints every line that didn't cause the next directive to be executed. (1 is an always-true condition.)
END {…} executes the code inside the braces at the end of the input.
The awk code writes to a temporary file which is then moved into place.

Sort a file based on length of the column/row

You can first add another column with count of characters with awk, do sort and then strip added column:

awk '{printf "%d %s\n", length($1), $0}' file.txt | sort -n -k1,1 | sed -E -e 's/^[0-9]+ //'

Best Answer

Related Solutions

Awk to remove line if argument is encountered in a specific column

Sort a file based on length of the column/row

Related Question