Keep only certain part of a string in a certain column

awkgrepsedstring

I have a file like this:

id  target_id                               length  eff_length
1   intron_FBgn0000721:20_FBgn0000721:18    1136    243.944268
1   intron_FBgn0000721:19_FBgn0000721:18    1122    240.237419
2   intron_FBgn0264373:2_FBgn0264373:3      56      0
3   intron_FBgn0027570:4_FBgn0027570:3      54      0

For the 2nd column target_id, I want to only keep the string (not always FBgnXXXX, sometimes other names) between intron_and the first :. So the new output file will have the simpler value for column 2 but the rest of the file remains the same.

I tried with sed command but don't know how to delete the part I don't need.

Best Answer

Using `sed` and `column`:

$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/' file | column -t
id  target_id    length  eff_length
1   FBgn0000721  1136    243.944268
1   FBgn0000721  1122    240.237419
2   FBgn0264373  56      0

The key part of this is the substitute command:

s/ intron_([^:]*):\S*/ \1/

It looks for intron_ and saves everything after intron_ and before the first colon into the variable 1. [^[:space:]]* matches everything from that colon to the end of the field. All of that gets replaced by the text saved in variable 1.

Using `awk` with tab-separated output:

$ awk -v "OFS=\t" '{$2=$2;sub(/intron_/, "", $2); sub(/:.*/, "", $2); print}' file
id      target_id       length  eff_length
1       FBgn0000721     1136    243.944268
1       FBgn0000721     1122    240.237419
2       FBgn0264373     56      0

Explanation:

-v "OFS=\t"

This sets the output field separator to a tab. This helps line up the columns, possibly making column unnecessary.
$2=$2

When printing a line, awk won't change to our newly-specified output field separator unless we change something on the line. Assigning the second field to the second field is sufficient to assure that the output will have tabs.
sub(/intron_/, "", $2)

This removes intron_ from the second field.
sub(/:.*/, "", $2)

This removes everything after the first colon from the second field.
print

This prints our new line.

Using `awk` with custom column formatting

This is like the above but uses printf so that we can custom-format column widths and alignments as desired:

$ awk  '{sub(/intron_/, "", $2); sub(/:.*/, "", $2); printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4}' file
id  target_id      length eff_length
1   FBgn0000721      1136 243.944268
1   FBgn0000721      1122 240.237419
2   FBgn0264373        56   0

Here the statement printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4 selects column widths and alignments in the usual printf style.

Using `sed` and converting from tab-separated to comma-separated

$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/; s/[[:space:]][[:space:]]*/,/g' file 
id,target_id,length,eff_length
1,FBgn0000721,1136,243.944268
1,FBgn0000721,1122,240.237419
2,FBgn0264373,56,0

Related Solutions

How to replace values in a string using sed but keep the string intact

sed will only replace the matching substring of a string.

$ echo '01XX1234' | sed 's/XX/AF/'
01AF1234

If you intent to always replace the third and forth character but keep the first and second intact you can use:

$ echo '01XX1234' | sed 's/\(.\{2\}\).\{2\}/\1AF/'
01AF1234

\{2\} means two occurrences of the character standing before (in this case .: any character)
$...$ parentheses are used to create a matching group; you can refer to the matched substring later in the pattern or in the replacement string
\1 refers to the first matched substring (in this case i.e. "01")

Awk search and replace string in a specific column of CSV file

awk -F '|' -v OFS='|' '$16 == "Market1" { $16 = "MarketPrime" }1' file.csv >new-file.csv

The only real issue in your code is that you set the input file separator to not just | but to spaces as well. This will make the spaces count as field separators in the data and it will be incredibly hard to figure out what the correct field number is (since some fields contain a variable number of spaces).

You also can not redirect into the same filename as you use to read from. Doing so would cause the shell to first truncate (empty) the output file, and your awk program would have no data to read.

Your code does a regular expression replacement. This is ok, but you need to be aware that if the 16th field happens to be something like Market12 or TheMarket1, it would trigger the substitution due to the missing anchor points. It would be safer to use ^Market1$ as the expression to replace, or to use a string comparison.

The awk command above uses only | as a field separator and then does a string comparison with the 16th field. If that field is Market1, it is set to MarketPrime.

The trailing 1 at the end of the awk code causes every record (modified or not) to be printed.

Best Answer

Using sed and column:

Using awk with tab-separated output:

Using awk with custom column formatting

Using sed and converting from tab-separated to comma-separated