Shell – Using AWK to select rows with specific value in specific column

awkcsvlinuxshell

I have a big csv file, which looks like this:

1,2,3,4,5,6,-99
1,2,3,4,5,6,-99
1,2,3,4,5,6,-99
1,2,3,4,5,6,25178
1,2,3,4,5,6,27986
1,2,3,4,5,6,-99

I want to select only the lines in which the 7th columns is equal to -99, so my output be:

1,2,3,4,5,6,-99
1,2,3,4,5,6,-99
1,2,3,4,5,6,-99
1,2,3,4,5,6,-99

I tried the following:

awk -F, '$7 == -99' input.txt > output.txt
awk -F, '{ if ($7 == -99) print $1,$2,$3,$4,$5,$6,$7 }' input.txt > output.txt

But both of them returned an empty output.txt. Can anyone tell me what I'm doing wrong?
Thanks.

Best Answer

The file that you run the script on has DOS line-endings. It may be that it was created on a Windows machine.

Use dos2unix to convert it to a Unix text file.

Alternatively, run it through tr:

tr -d '\r' <input.txt >input-unix.txt

Then use input-unix.txt with your otherwise correct awk code.

To modify the awk code instead of the input file:

awk -F, '$7 == "-99\r"' input.txt >output.txt

This takes the carriage-return at the end of the line into account.

Or,

awk -F, '$7 + 0 == -99' input.txt >output.txt

This forces the 7th column to be interpreted as a number, which "removes" the carriage-return.

Similarly,

awk -F, 'int($7) == -99' input.txt >output.txt

would also remove the \r.

Related Solutions

Unexpected output from awk printf

The first argument to printf, whether it's C printf() or the printf utility or awk's printf() is required¹ and is the format.

You want:

awk '{printf "%s", $0}'

here. If you don't want an output record separator, you can also do:

awk -v ORS= '{print}' < mycsv.csv

Or even:

awk -v ORS= 1 < mycsv.csv

({print} is the default action, true is the default condition, but you need to specify at least one action or condition, 1 is one way to say true).

Though here, tr would be enough:

tr -d '\n' < mycsv.csv

Or if you still want one trailing newline character so that output is still text:

paste -sd '\0' mycsv.csv

It also seems like your file has Microsoft-style CRLF line delimiters, so you may want to also delete the CR characters:

tr -d '\r\n' < mycsv.csv

Or only the CRLF sequences with awk implementations that support more than single-character RS (which includes gawk and mawk but not macOS awk):

awk -v RS='\r\n' -v ORS= 1 < mycsv.csv

Or:

awk -v RS='\r?\n' -n ORS= 1 < mycsv.csv

that is with the \r optional to handle either Unix or MS-DOS line delimiters.

Or use things like dos2unix or d2u to convert the file to Unix format first.

Notes

¹ the format argument to printf is required in the standard specification of the awk utility. In gawk and mawk omitting it results in an error. In busybox awk, it's equivalent to printf "" and in awk derived from the original implementation (like on macOS), it's equivalent to printf $0 (of little usefulness as it's still considered as a format, you'll still get an error if $0 contains % characters).

Linux – Select Columns Where Value Appears More Than X Times Using awk

BEGIN { OFS = FS = "\t" }

FNR == NR {
        for (i = 2; i <= NF; ++i)
                if ($i == 2) ++c[i]
        next
}

{
        a[nf=1] = $1
        for (i = 2; i <= NF; ++i)
                if (c[i] >= t) a[++nf] = $i

        $0 = ""
        for (i = 1; i <= nf; ++i)
                $i = a[i]

        print
}

This awk program would count the number of occurrences of the value 2 in each column and store these counts in the array c (one lement in this array per column of data). It does this while reading the input file the first time (this is the FNR == NR block).

When reading the input file a second time it uses these counts to transfer the appropriate columns from the input to the array a for each line read. The value of the variable t is used as the threshold value to decide whether the column should be included or not. This is the first for loop in the last block in the code.

It then creates a new data record from this array and prints it.

Testing it (note that the input file is given twice on the command line for awk to be able to do two passes over it):

$ cat file
Individuals     M1      M2      M3
Ind1    0       0       2
Ind2    0       2       2
Ind3    2       2       2

$ awk -v t=1 -f script.awk file file
Individuals     M1      M2      M3
Ind1    0       0       2
Ind2    0       2       2
Ind3    2       2       2

$ awk -v t=2 -f script.awk file file
Individuals     M2      M3
Ind1    0       2
Ind2    2       2
Ind3    2       2

$ awk -v t=3 -f script.awk file file
Individuals     M3
Ind1    2
Ind2    2
Ind3    2

$ awk -v t=4 -f script.awk file file
Individuals
Ind1
Ind2
Ind3

Best Answer

Related Solutions

Unexpected output from awk printf

Notes

Linux – Select Columns Where Value Appears More Than X Times Using awk

Related Question