Lum – `cut`: selecting columns containing a string

columnscutregular expressionterminaltext processing

I have a big file with several columns on each line. I'm familiar with using cut -f -d to select specific columns by their number.

I checked the manual for cut and it doesn't seem that there's a way to regex match columns.

What I want to do specifically is:

select the 2nd column of every line
and also select all columns that contain the string "hello" (there may be none, if not it could be any column(s) and not the same column(s) for each line)

What's the most convenient terminal tools for this operation?

EDIT:

Simplified example

x ID23 a b c hello1
x ID47 hello2 a b c
x ID49 hello3 a b hello4
x ID53 a b c d

The result I would want is:

ID23 hello1
ID47 hello2
ID49 hello3 hello4

or alternatively:

ID23 hello1
ID47 hello2
ID49 hello3 hello4
ID53

To elaborate the example given:

Columns are defined by one space
whether or not "only print if the string is present" is not really important, I can just grep for "hello" if necessary
we can assume the string "hello" will never be in column 1 or 2.

Best Answer

If one space at the end of the line doesn't hurt you much:

$ awk '{for(i=1;i<=NF;i++) if(i==2 || $i~"hello") printf $i" ";print ""}' file

ID23 hello1 
ID47 hello2 
ID49 hello3 hello4 
ID53

This doesn't assume anything about the position of the "hello" string.

Related Solutions

Print columns that start with a specific string

With awk:

awk '{for(i=5;i<=NF;i++){if($i~/^ANC=/){a=$i}} print $1,$2,$3,$4,a}' file

for(...) loops through all fields, starting with field 5 (i=5).
- if($i~/^ANC=/) checks if the field starts with ANC=
- a=$i if yes, set variable a to that value
print $1,$2,$3,$4,a print fields 1-4 followed by whatever is stored in a.

Can be combined with BEGIN {OFS="\t"} of course.

Shell – Replace each unique value in all columns with a unique identifier

Here's a very simple approach. Works fine for me, using gawk 3.1.7.

#!/usr/bin/awk -f
{
    for(x=2;x<=NF;x++) {
        if(x$x in a) {
            $x=a[x$x]
        } else {
            if($x=="NA") {
                print $x,0 > "column"x
                a[x$x]=0
                $x="0"
            } else {
                m[x]++
                print $x,m[x] > "column"x
                a[x$x]=m[x]
                $x=m[x]
            }
        }
    }
    print $0 > "results"
}

Related Question