Remove string from a particular field using awk/sed

awkbioinformaticsregular expressionsed

I have a file (>80,000 lines) that looks likes this:

chr1    GTF2GFF chromosome  1   249213345   .   .   .   ID=chr1;Name=chr1
chr1    GTF2GFF gene    11874   14408   .   +   .   ID=DDX11L1;Note=unknown;Name=DDX11L1
chr1    GTF2GFF exon    11874   12227   .   +   .   Parent=NR_046018_1
chr1    GTF2GFF exon    12613   12721   .   +   .   Parent=NR_046018_1
chr1    GTF2GFF exon    13221   14408   .   +   .   Parent=NR_046018_1
chr1    GTF2GFF gene    14362   29370   .   -   .   ID=WASH7P;Note=unknown;Name=WASH7P
chr1    GTF2GFF exon    14362   14829   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    14970   15038   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    15796   15947   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    16607   16765   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    16858   17055   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    17233   17368   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    17606   17742   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    17915   18061   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    18268   18366   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    24738   24891   .   -   .   Parent=NR_024540
chr1    GTF2GFF exon    29321   29370   .   -   .   Parent=NR_024540
chr1    GTF2GFF gene    34611   36081   .   -   .   ID=FAM138A;Note=unknown;Name=FAM138A
chr1    GTF2GFF exon    34611   35174   .   -   .   Parent=NR_026818
chr1    GTF2GFF exon    35277   35481   .   -   .   Parent=NR_026818

and I want to extract only the rows that that contain "gene" in the 3rd field and re-arrange the 9th field to contain only the ID value (for example, DDX11L1). This is the desired output:

chr1    11874   14408   DDX11L1    .       +
chr1    14362   29370   WASH7P      .       -
chr1    34611   36081   FAM138A    .       -

Using awk I got the desired fields easily:

head -20 genes.gff3 | awk '$3=="gene" {print $1 "\t" $4 "\t" $5 "\t" $9"\t" $6 "\t" $7}'
chr1    11874   14408   ID=DDX11L1;Note=unknown;Name=DDX11L1    .       +
chr1    14362   29370   ID=WASH7P;Note=unknown;Name=WASH7P      .       -
chr1    34611   36081   ID=FAM138A;Note=unknown;Name=FAM138A    .       -

But I am struggling with getting the ID value. I've tried piping it to sed:

head -20 genes.gff3 | awk '$3=="gene" {print $1 "\t" $4 "\t" $5 "\t" $9"\t" $6 "\t" $7}' | sed 's/\(^.+\t\)ID=\(\w+\).+\(\t.+$\)/\1\2\3/g'

and also gsub

head -20 genes.gff3 | awk '$3=="gene" {gsub(/\(^.+\t\)ID=\(\w+\).+\(\t.+$\)/, "\1\2\3", $9); print $1 "\t" $4 "\t" $5 "\t" $9"\t" $6 "\t" $7}' 

But the result is same as using awk alone. How can I extract the ID value? I feel that I am really close to a solution here.

Cheers.

Best Answer

You could split the field and use substr by:

split($9, a, ";")
print substr(a[1], 4)

Awk indexes start at 1.

Another option could be to modify the input field separator (FS). FS is space, " ", by default – which also has the special effect of ignoring leading and trailing spaces.

Also, instead of using print $1, \t, ... or the printf variant one could set OFS to tab.


Examples:

Modifying FS:

awk -F" +|;|=" '

$3 == "gene" {
    printf("%s\t%s\t%s\t%s\t%s\t%s\t\n",
    $1, $4, $5, $10, $6, $7);
}
' data.file

Using split:

awk '
$3 == "gene" {
    split($9, a, ";")
    printf("%s\t%s\t%s\t%s\t%s\t%s\t\n",
    $1, $4, $5, substr(a[1], 3), $6, $7);
}
' data.file

OFS and FS:

Output Field Separator (OFS) as tab, and alternative FS inside awk. Also updated FS to include tab:

awk '
BEGIN {
    FS="[ \t]+|;|="
    OFS="\t"
}
$3 == "gene" {
    print $1, $4, $5, $10, $6, $7
}

' data.file

Also see The Open Group Variables and Special Variables, Examples.

Gawk manual – it usually is noted when things are a gawk extension to awk.

Related Question