I have a file (>80,000 lines) that looks likes this:
chr1 GTF2GFF chromosome 1 249213345 . . . ID=chr1;Name=chr1
chr1 GTF2GFF gene 11874 14408 . + . ID=DDX11L1;Note=unknown;Name=DDX11L1
chr1 GTF2GFF exon 11874 12227 . + . Parent=NR_046018_1
chr1 GTF2GFF exon 12613 12721 . + . Parent=NR_046018_1
chr1 GTF2GFF exon 13221 14408 . + . Parent=NR_046018_1
chr1 GTF2GFF gene 14362 29370 . - . ID=WASH7P;Note=unknown;Name=WASH7P
chr1 GTF2GFF exon 14362 14829 . - . Parent=NR_024540
chr1 GTF2GFF exon 14970 15038 . - . Parent=NR_024540
chr1 GTF2GFF exon 15796 15947 . - . Parent=NR_024540
chr1 GTF2GFF exon 16607 16765 . - . Parent=NR_024540
chr1 GTF2GFF exon 16858 17055 . - . Parent=NR_024540
chr1 GTF2GFF exon 17233 17368 . - . Parent=NR_024540
chr1 GTF2GFF exon 17606 17742 . - . Parent=NR_024540
chr1 GTF2GFF exon 17915 18061 . - . Parent=NR_024540
chr1 GTF2GFF exon 18268 18366 . - . Parent=NR_024540
chr1 GTF2GFF exon 24738 24891 . - . Parent=NR_024540
chr1 GTF2GFF exon 29321 29370 . - . Parent=NR_024540
chr1 GTF2GFF gene 34611 36081 . - . ID=FAM138A;Note=unknown;Name=FAM138A
chr1 GTF2GFF exon 34611 35174 . - . Parent=NR_026818
chr1 GTF2GFF exon 35277 35481 . - . Parent=NR_026818
and I want to extract only the rows that that contain "gene" in the 3rd field and re-arrange the 9th field to contain only the ID value (for example, DDX11L1). This is the desired output:
chr1 11874 14408 DDX11L1 . +
chr1 14362 29370 WASH7P . -
chr1 34611 36081 FAM138A . -
Using awk I got the desired fields easily:
head -20 genes.gff3 | awk '$3=="gene" {print $1 "\t" $4 "\t" $5 "\t" $9"\t" $6 "\t" $7}'
chr1 11874 14408 ID=DDX11L1;Note=unknown;Name=DDX11L1 . +
chr1 14362 29370 ID=WASH7P;Note=unknown;Name=WASH7P . -
chr1 34611 36081 ID=FAM138A;Note=unknown;Name=FAM138A . -
But I am struggling with getting the ID value. I've tried piping it to sed:
head -20 genes.gff3 | awk '$3=="gene" {print $1 "\t" $4 "\t" $5 "\t" $9"\t" $6 "\t" $7}' | sed 's/\(^.+\t\)ID=\(\w+\).+\(\t.+$\)/\1\2\3/g'
and also gsub
head -20 genes.gff3 | awk '$3=="gene" {gsub(/\(^.+\t\)ID=\(\w+\).+\(\t.+$\)/, "\1\2\3", $9); print $1 "\t" $4 "\t" $5 "\t" $9"\t" $6 "\t" $7}'
But the result is same as using awk alone. How can I extract the ID value? I feel that I am really close to a solution here.
Cheers.
Best Answer
You could
split
the field and usesubstr
by:Awk indexes start at
1
.Another option could be to modify the input field separator (
FS
).FS
is space, " ", by default – which also has the special effect of ignoring leading and trailing spaces.Also, instead of using
print $1, \t, ...
or theprintf
variant one could setOFS
to tab.Examples:
Modifying FS:
Using split:
OFS and FS:
Output Field Separator (
OFS
) as tab, and alternativeFS
inside awk. Also updatedFS
to include tab:Also see The Open Group Variables and Special Variables, Examples.
Gawk manual – it usually is noted when things are a gawk extension to awk.