I have a file with 7 columns, a GFF file having chromosomal regions.I want to collapse the rows where REGION ="exon" to only one row in the file.The row has to be collapsed on the basis of regions being overlapping with each other.
REGION START END SCORE STRAND FRAME ATTRIBUTE
exon 26453 26644 . + . Transcript "XM_092971"; Name "XM_092971"
exon 26842 27020 . + . Transcript "XM_092971"; Name "XM_092971"
exon 30355 30899 . - . Transcript "XM_104663"; Name "XM_104663"
GS_TRAN 30355 34083 . - . GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1"
snp 30847 30847 . + . SNP "rs2971719"; Name "rs2971719"
exon 31012 31409 . - . Transcript "XM_104663"; Name "XM_104663"
exon 34013 34083 . - . Transcript "XM_104663"; Name "XM_104663"
exon 40932 41071 . + . Transcript "XM_092971"; Name "XM_092971"
snp 44269 44269 . + . SNP "rs2873227"; Name "rs2873227"
snp 45723 45723 . + . SNP "rs2227095"; Name "rs2227095"
exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913"
exon 134034 134457 . - . Transcript "XM_086914"; Name "XM_086914"
Looking at the sample data above,only the last two rows can be merged into one row.So,the new row will become.
exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913"
In case,the end of the other row would have been greater than its previous,that would be the END region in that case.Basically,if there is any overlap,then take the region which starts Earlier,and the one which ends later.
There can be multiple rows of such instance,here only last 2 rows are there.One thing is that the ATRRIBUTE column will definitely show different Transcript names for such rows,which are mostly same in other cases.
Any suggestions on how to proceed.
UPDATED EXAMPLE: If the last 2 rows are like this
exon 134031 134457 . - . Transcript "XM_086913"; Name "XM_086913"
exon 134034 134495 . - . Transcript "XM_086914"; Name "XM_086914"
Then the Output should be :
exon 134031 134495 . - . Transcript "XM_086913"; Transcript "XM_086914"
Basically the START from first and END from second.Since I want to cover the overlap in one row only,instead of 2 or 3 or more rows.Here the overlap is between 2 rows, but could be between more than 2 rows as well.
UPDATED EXAMPLE (3/24/2014)
chr1 HAVANA stop_codon 1120520 1120522 . + 0 gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1115077 1115233 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1115414 1115433 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1120520 1121244 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA transcript 1115864 1119307 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA exon 1115864 1116240 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA *exon 1117121 1117195* . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA *exon 1117150 1117826* . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA exon 1118256 1118427 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA transcript 1190648 1209229 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1209046 1209229 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1203113 1203372 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA CDS 1203241 1203372 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA start_codon 1203370 1203372 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA stop_codon 1203238 1203240 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1198726 1198766 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1192588 1192690 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1192372 1192510 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA *exon 1191425 1191505* . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA *exon 1190648 1191470* . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
The upper portion shows the overlap in "+" strand,and the below portion shows the overlap in "-" strand.The "-" strand has decreasing regions,so overlap will be like shown in the last 2 rows.Both are different genes.So overlap has to be per gene,as sometimes different genes have overlapping exons also but this is very rare as I have read in some posts.
The gene information can be extracted from the last column,present as"gene_name".
The two rows from gene_name=TTLL10 have overlapping exons,so they will be merged, in the final output.
chr1 HAVANA *exon 1117121 1117195* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1 HAVANA *exon 1117150 1117826* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
The two rows from gene_name= UBE2J2 have overlap exons.
chr1 HAVANA *exon 1191425 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
chr1 HAVANA *exon 1190648 1191470* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
SAMPLE OUTPUT
The rest of the rows remain same,and the above rows get merged for each gene.
chr1 HAVANA *exon 1117121 1117826* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1 HAVANA *exon 1190648 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
In case,the transcript_ids are different,both transcript id's would be printed although gene_name will remain same.for e.g If for gene,the transcript id's were different like below:
chr1 HAVANA *exon 1191425 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
chr1 HAVANA *exon 1190648 1191470* . - . transcript_id "ENST00000473215.2"; gene_name "UBE2J2";
This will merge to as above,but should have both transcript names.Since,I believe it might be needed and will be important later on to preserve the transcript information.
chr1 HAVANA *exon 1190648 1191505* . - . transcript_id "ENST00000473215.1"; "ENST00000473215.2" gene_name "UBE2J2";
Best Answer
An 'awk' approach,