Print only lines where the first column is unique

awktext processing

I am looking for a way to sort a list and print all lines, whose first column appears only once – i.e., match only on the first column.
For example, I have a file where the first column is a path and the second column contains a 'type'

/path/foo/1 footsy
/path/foo/1 barsy
/path/foo/X barsy
/path/bar/2 footsy
/path/bar/2 barsy
/path/foo/Y footsy

(the file is actually sorted -k1,1)

Now, I would like to extract only cases like

/path/foo/X barsy
/path/foo/Y footsy

I am thinking about some way with awk, where I would have to store the previous line and compare the first field of the previous line to the corresponding field in the current line. But I have not yet an idea how to get it done 🙁
I tried to adapt a solution found in another question but it is not really working as hoped

awk '{
  prev=$0; path=$1; type=$2
  getline
  if ($1 != $path) {
    print prev
  }
}'

Best Answer

awk normally reads each line of the input and invokes the script on it. The cases where you would use getline are few and far between. When your script is run with six lines of input, this is an overview of what happens:

Read line 1 normally

Set variables
Call getline, which reads line 2
Compare variables

Read line 3 normally

Set variables
Call getline, which reads line 4
Compare variables

Read line 5 normally

Set variables
Call getline, which reads line 6
Compare variables

Obviously this isn’t going to work.
Secondly, you made a common mistake in your awk code. In awk, fields from the input are referenced as $number and variables are referenced as variable_name. This is different from shell scripts, where command line arguments are referenced as $number and variables are referenced as $variable_name. Your test
```
if ($1 != $path)
```
should be
```
if ($1 != path)
```
Your overall approach is flawed. You can’t identify strings that occur only once in the file by looking at two lines at a time. I believe that you can do it by looking at three lines at a time (i.e., by keeping the two previous lines in variables), but things like that get complicated and messy. It’s probably simpler to count occurrences. Here’s a minimal modification on your script to do that.
```
awk '{
  if ($1 != path) {
    if (count == 1) {
      print prev
    }
    count=1
  }
  else count++
  prev=$0; path=$1
}
END {
    if (count == 1) {
      print prev
    }
}'
```
I deleted type, since you never used it.

Disclosure: This is essentially the same as the last part of glenn’s answer.

Related Solutions

Extract Strings from First Column of a File – Text Processing Guide

You can use the following awk program:

awk -F' *|' 'NR==FNR{searchstr[$1]=1} NR>FNR && ($1 in searchstr) {print}' string.txt masterFile.list

As you can see, you provide both files as arguments to awk.

While the first file is processed (indicated by FNR, the per-file line-counter, being equal to NR, the global line counter), we simply register all search strings (field nr. 1 of each line, since they are the only items) in an array searchstr (however, in form of an array index, so the "value" is just a dummy value of 1).
When we come to the second file (NR is now greater than FNR), we check if the first column ($1) is contained as an array index in searchstr. If so, we print the entire line.

The idea behind this is that awk has a convenient syntax string in array which is true if string is in the list of array indices of array.

As noted by Ed Morton, you can "golf" this into

awk -F' *|' 'NR==FNR{searchstr[$1]; next} $1 in searchstr' string.txt masterFile.list

The searchstr[$1] call will define (but not fill) that array entry, and the $1 in searchstr outside of the rule block will - if evaluating to true - instruct awk to print the current line. The next instruction in the rule for processing string.txt will ensure that this part is only reached for masterFile.list

Note that I specified a full regular expression ( *|, i.e. any amount of space, followed by |) as field separator in order to ensure that the "first field" of masterFile.list really is only the number - specifying -F'|' would have meant that trailing space is included, too, and would have made the matching process more involved. If the "spaces" can actually also contain TABs, use -F'[[:space:]]*|' instead.

Linux – Compare Two Tab-Delimited Files by First Column

Read fileB.txt first, make the 1st field a key and the 2nd field its value in an array, skipping the header line with FNR>1 (What are NR and FNR and what does "NR==FNR" imply?).

Then read fileA.txt, print its header for the first line and then print its 1st field followed by the corresponding element in the array, if any.

awk '
    FNR==NR && FNR>1{a[$1]=$2}
    NR!=FNR{
        if(FNR>1){print $1,a[$1]}
        else{print "id", "freq.var"}
    }
' OFS="\t" fileB.txt fileA.txt

OFS="\t" sets the output field separator to tab. Since your file is tab delimited, I assume the output file should be tab delimited too.

You can pipe that into column -t for alignment.

Best Answer

Related Solutions

Extract Strings from First Column of a File – Text Processing Guide

Linux – Compare Two Tab-Delimited Files by First Column

Related Question