Keep only certain part of a string in a certain column

awkgrepsedstring

I have a file like this:

id  target_id                               length  eff_length
1   intron_FBgn0000721:20_FBgn0000721:18    1136    243.944268
1   intron_FBgn0000721:19_FBgn0000721:18    1122    240.237419
2   intron_FBgn0264373:2_FBgn0264373:3      56      0
3   intron_FBgn0027570:4_FBgn0027570:3      54      0

For the 2nd column target_id, I want to only keep the string (not always FBgnXXXX, sometimes other names) between intron_and the first :. So the new output file will have the simpler value for column 2 but the rest of the file remains the same.

I tried with sed command but don't know how to delete the part I don't need.

Best Answer

Using sed and column:

$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/' file | column -t
id  target_id    length  eff_length
1   FBgn0000721  1136    243.944268
1   FBgn0000721  1122    240.237419
2   FBgn0264373  56      0

The key part of this is the substitute command:

s/ intron_([^:]*):\S*/ \1/

It looks for intron_ and saves everything after intron_ and before the first colon into the variable 1. [^[:space:]]* matches everything from that colon to the end of the field. All of that gets replaced by the text saved in variable 1.

Using awk with tab-separated output:

$ awk -v "OFS=\t" '{$2=$2;sub(/intron_/, "", $2); sub(/:.*/, "", $2); print}' file
id      target_id       length  eff_length
1       FBgn0000721     1136    243.944268
1       FBgn0000721     1122    240.237419
2       FBgn0264373     56      0

Explanation:

  • -v "OFS=\t"

    This sets the output field separator to a tab. This helps line up the columns, possibly making column unnecessary.

  • $2=$2

    When printing a line, awk won't change to our newly-specified output field separator unless we change something on the line. Assigning the second field to the second field is sufficient to assure that the output will have tabs.

  • sub(/intron_/, "", $2)

    This removes intron_ from the second field.

  • sub(/:.*/, "", $2)

    This removes everything after the first colon from the second field.

  • print

    This prints our new line.

Using awk with custom column formatting

This is like the above but uses printf so that we can custom-format column widths and alignments as desired:

$ awk  '{sub(/intron_/, "", $2); sub(/:.*/, "", $2); printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4}' file
id  target_id      length eff_length
1   FBgn0000721      1136 243.944268
1   FBgn0000721      1122 240.237419
2   FBgn0264373        56   0

Here the statement printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4 selects column widths and alignments in the usual printf style.

Using sed and converting from tab-separated to comma-separated

$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/; s/[[:space:]][[:space:]]*/,/g' file 
id,target_id,length,eff_length
1,FBgn0000721,1136,243.944268
1,FBgn0000721,1122,240.237419
2,FBgn0264373,56,0
Related Question