I have a file like this:
id target_id length eff_length
1 intron_FBgn0000721:20_FBgn0000721:18 1136 243.944268
1 intron_FBgn0000721:19_FBgn0000721:18 1122 240.237419
2 intron_FBgn0264373:2_FBgn0264373:3 56 0
3 intron_FBgn0027570:4_FBgn0027570:3 54 0
For the 2nd column target_id
, I want to only keep the string (not always FBgnXXXX
, sometimes other names) between intron_
and the first :
. So the new output file will have the simpler value for column 2 but the rest of the file remains the same.
I tried with sed command but don't know how to delete the part I don't need.
Best Answer
Using
sed
andcolumn
:The key part of this is the substitute command:
It looks for
intron_
and saves everything afterintron_
and before the first colon into the variable1
.[^[:space:]]*
matches everything from that colon to the end of the field. All of that gets replaced by the text saved in variable1
.Using
awk
with tab-separated output:Explanation:
-v "OFS=\t"
This sets the output field separator to a tab. This helps line up the columns, possibly making
column
unnecessary.$2=$2
When printing a line,
awk
won't change to our newly-specified output field separator unless we change something on the line. Assigning the second field to the second field is sufficient to assure that the output will have tabs.sub(/intron_/, "", $2)
This removes
intron_
from the second field.sub(/:.*/, "", $2)
This removes everything after the first colon from the second field.
print
This prints our new line.
Using
awk
with custom column formattingThis is like the above but uses
printf
so that we can custom-format column widths and alignments as desired:Here the statement
printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4
selects column widths and alignments in the usualprintf
style.Using
sed
and converting from tab-separated to comma-separated