Print columns that start with a specific string

awksedtext processing

I have a file that looks something like this:

ID101     G    T     freq=.5     nonetype     ANC=.1     addinfor
ID102     A    T     freq=.3     ANC=.01    addinfor
ID102     A    T     freq=.01     type=1     ALT=0.022    ANC=.02    addinfor

As you can see, each line has a slightly different number of columns. I specifically want column 1, column 2, column 3, column 4 and the column that starts with ANC=

Desired output:

ID101     G    T     freq=.5     ANC=.1
ID102     A    T     freq=.3     ANC=.01
ID102     A    T     freq=.01    ANC=.02

I generally use the an awk command to parse files:

awk 'BEGIN {OFS = "\t"} {print $1, $2, $3, $4}'

Is there an easy way to alter this command to work for situations like this?

I think something like this might work:

awk '{for(j=1;j<=NF;j++){if($j~/^ANC=/){print $j}}}'

However, how can I edit this to also print the first columns?

Best Answer

With awk:

awk '{for(i=5;i<=NF;i++){if($i~/^ANC=/){a=$i}} print $1,$2,$3,$4,a}' file

  • for(...) loops through all fields, starting with field 5 (i=5).
    • if($i~/^ANC=/) checks if the field starts with ANC=
    • a=$i if yes, set variable a to that value
  • print $1,$2,$3,$4,a print fields 1-4 followed by whatever is stored in a.

Can be combined with BEGIN {OFS="\t"} of course.

Related Question