Using sed
and column
:
$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/' file | column -t
id target_id length eff_length
1 FBgn0000721 1136 243.944268
1 FBgn0000721 1122 240.237419
2 FBgn0264373 56 0
The key part of this is the substitute command:
s/ intron_([^:]*):\S*/ \1/
It looks for intron_
and saves everything after intron_
and before the first colon into the variable 1
. [^[:space:]]*
matches everything from that colon to the end of the field. All of that gets replaced by the text saved in variable 1
.
Using awk
with tab-separated output:
$ awk -v "OFS=\t" '{$2=$2;sub(/intron_/, "", $2); sub(/:.*/, "", $2); print}' file
id target_id length eff_length
1 FBgn0000721 1136 243.944268
1 FBgn0000721 1122 240.237419
2 FBgn0264373 56 0
Explanation:
-v "OFS=\t"
This sets the output field separator to a tab. This helps line up the columns, possibly making column
unnecessary.
$2=$2
When printing a line, awk
won't change to our newly-specified output field separator unless we change something on the line. Assigning the second field to the second field is sufficient to assure that the output will have tabs.
sub(/intron_/, "", $2)
This removes intron_
from the second field.
sub(/:.*/, "", $2)
This removes everything after the first colon from the second field.
print
This prints our new line.
Using awk
with custom column formatting
This is like the above but uses printf
so that we can custom-format column widths and alignments as desired:
$ awk '{sub(/intron_/, "", $2); sub(/:.*/, "", $2); printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4}' file
id target_id length eff_length
1 FBgn0000721 1136 243.944268
1 FBgn0000721 1122 240.237419
2 FBgn0264373 56 0
Here the statement printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4
selects column widths and alignments in the usual printf
style.
Using sed
and converting from tab-separated to comma-separated
$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/; s/[[:space:]][[:space:]]*/,/g' file
id,target_id,length,eff_length
1,FBgn0000721,1136,243.944268
1,FBgn0000721,1122,240.237419
2,FBgn0264373,56,0
To sort you can use a pipe also inside of an awk
command, as in:
awk '{ print ... | "sort ..." }'
The syntax means that all respective lines of the data file will be passed to the same instance of sort.
Of course you can also do that equivalently on shell level:
awk '{ print ... }' | sort ...
Or you can use GNU awk
which has a couple sort functions natively defined.
The uniq
is in awk
typically accomplished by saving the "unique data element or key" in an associative array and checking whether new data need to be memorized. One example to illustrate:
awk '!a[$0]++'
This means: If the current line is not in the array then the condition is true and the default action to print the line triggered. Subsequent lines with the same data will result in a false condition and the data will not be printed.
Best Answer
One way to do it: