Print only lines where the first column is unique

awktext processing

I am looking for a way to sort a list and print all lines, whose first column appears only once – i.e., match only on the first column.
For example, I have a file where the first column is a path and the second column contains a 'type'

/path/foo/1 footsy
/path/foo/1 barsy
/path/foo/X barsy
/path/bar/2 footsy
/path/bar/2 barsy
/path/foo/Y footsy

(the file is actually sorted -k1,1)

Now, I would like to extract only cases like

/path/foo/X barsy
/path/foo/Y footsy

I am thinking about some way with awk, where I would have to store the previous line and compare the first field of the previous line to the corresponding field in the current line. But I have not yet an idea how to get it done 🙁
I tried to adapt a solution found in another question but it is not really working as hoped

awk '{
  prev=$0; path=$1; type=$2
  getline
  if ($1 != $path) {
    print prev
  }
}'

Best Answer

  1. awk normally reads each line of the input and invokes the script on it.  The cases where you would use getline are few and far between.  When your script is run with six lines of input, this is an overview of what happens:

    Read line 1 normally

    Set variables
    Call getline, which reads line 2
    Compare variables

    Read line 3 normally

    Set variables
    Call getline, which reads line 4
    Compare variables

    Read line 5 normally

    Set variables
    Call getline, which reads line 6
    Compare variables

    Obviously this isn’t going to work.

  2. Secondly, you made a common mistake in your awk code.  In awk, fields from the input are referenced as $number and variables are referenced as variable_name.  This is different from shell scripts, where command line arguments are referenced as $number and variables are referenced as $variable_name.  Your test

    if ($1 != $path)
    

    should be

    if ($1 != path)
    
  3. Your overall approach is flawed.  You can’t identify strings that occur only once in the file by looking at two lines at a time.  I believe that you can do it by looking at three lines at a time (i.e., by keeping the two previous lines in variables), but things like that get complicated and messy.  It’s probably simpler to count occurrences.  Here’s a minimal modification on your script to do that.

    awk '{
      if ($1 != path) {
        if (count == 1) {
          print prev
        }
        count=1
      }
      else count++
      prev=$0; path=$1
    }
    END {
        if (count == 1) {
          print prev
        }
    }'
    

    I deleted type, since you never used it.

    Disclosure: This is essentially the same as the last part of glenn’s answer.

Related Question