Bash – Merge fields in a file

awkbashbioinformaticssedshell

I have a file with 7 columns, a GFF file having chromosomal regions.I want to collapse the rows where REGION ="exon" to only one row in the file.The row has to be collapsed on the basis of regions being overlapping with each other.

REGION  START   END  SCORE STRAND FRAME     ATTRIBUTE
 exon   26453   26644   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 exon   26842   27020   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 exon   30355   30899   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 GS_TRAN    30355   34083   .   -   .   GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1"
 snp    30847   30847   .   +   .   SNP "rs2971719"; Name "rs2971719"
 exon   31012   31409   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 exon   34013   34083   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 exon   40932   41071   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 snp    44269   44269   .   +   .   SNP "rs2873227"; Name "rs2873227"
 snp    45723   45723   .   +   .   SNP "rs2227095"; Name "rs2227095"
 exon   134031  134495  .   -   .   Transcript "XM_086913"; Name "XM_086913"            
 exon   134034  134457  .   -   .   Transcript "XM_086914"; Name "XM_086914"            

Looking at the sample data above,only the last two rows can be merged into one row.So,the new row will become.

exon    134031  134495  .   -   .   Transcript "XM_086913"; Name "XM_086913"            

In case,the end of the other row would have been greater than its previous,that would be the END region in that case.Basically,if there is any overlap,then take the region which starts Earlier,and the one which ends later.

There can be multiple rows of such instance,here only last 2 rows are there.One thing is that the ATRRIBUTE column will definitely show different Transcript names for such rows,which are mostly same in other cases.

Any suggestions on how to proceed.

UPDATED EXAMPLE: If the last 2 rows are like this

  exon  134031  134457  .   -   .   Transcript "XM_086913"; Name "XM_086913"            
  exon  134034  134495  .   -   .   Transcript "XM_086914"; Name "XM_086914"

Then the Output should be :

exon    134031  134495  .   -   .   Transcript "XM_086913"; Transcript "XM_086914"

Basically the START from first and END from second.Since I want to cover the overlap in one row only,instead of 2 or 3 or more rows.Here the overlap is between 2 rows, but could be between more than 2 rows as well.

UPDATED EXAMPLE (3/24/2014)

chr1    HAVANA  stop_codon  1120520 1120522 .   +   0   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1115077 1115233 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1115414 1115433 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1120520 1121244 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  transcript  1115864 1119307 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  exon    1115864 1116240 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  *exon   1117121 1117195*    .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  *exon   1117150 1117826*    .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  exon    1118256 1118427 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";

chr1    HAVANA  transcript  1190648 1209229 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1209046 1209229 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1203113 1203372 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  CDS 1203241 1203372 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  start_codon 1203370 1203372 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  stop_codon  1203238 1203240 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1198726 1198766 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1192588 1192690 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1192372 1192510 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  *exon   1191425 1191505*    .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  *exon   1190648 1191470*    .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";

The upper portion shows the overlap in "+" strand,and the below portion shows the overlap in "-" strand.The "-" strand has decreasing regions,so overlap will be like shown in the last 2 rows.Both are different genes.So overlap has to be per gene,as sometimes different genes have overlapping exons also but this is very rare as I have read in some posts.
The gene information can be extracted from the last column,present as"gene_name".

The two rows from gene_name=TTLL10 have overlapping exons,so they will be merged, in the final output.

chr1    HAVANA  *exon   1117121 1117195*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10"; 
chr1    HAVANA  *exon   1117150 1117826*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10"; 

The two rows from gene_name= UBE2J2 have overlap exons.

 chr1   HAVANA  *exon   1191425 1191505*    .   -   .   transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; 
  chr1  HAVANA  *exon   1190648 1191470*    .   -   .   transcript_id "ENST00000473215.1";  gene_name "UBE2J2"; 

SAMPLE OUTPUT

The rest of the rows remain same,and the above rows get merged for each gene.

chr1    HAVANA  *exon   1117121 1117826*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1    HAVANA  *exon   1190648 1191505*    .   -   .   transcript_id "ENST00000473215.1";  gene_name "UBE2J2"; 

In case,the transcript_ids are different,both transcript id's would be printed although gene_name will remain same.for e.g If for gene,the transcript id's were different like below:

  chr1  HAVANA  *exon   1191425 1191505*    .   -   .   transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; 
  chr1  HAVANA  *exon   1190648 1191470*    .   -   .   transcript_id "ENST00000473215.2";  gene_name "UBE2J2"; 

This will merge to as above,but should have both transcript names.Since,I believe it might be needed and will be important later on to preserve the transcript information.

  chr1  HAVANA  *exon   1190648 1191505*    .   -   .   transcript_id "ENST00000473215.1"; "ENST00000473215.2"  gene_name "UBE2J2"; 

Best Answer

An 'awk' approach,

awk '
  $1!="exon" {                       # If the first died is unequal to "exon"
    if(previous)print previous       # If there is a previous line then print it
    print                            # Print the current line
    previous=start=end=exon_seq=""   # Set all variable to an empty string
    next                             # Move on to the next line in the input file
  }
  {
    if(exon_seq) {                   # if there is a sequence of lines with "exon in field 1
      if(start<=$2 && end>=$3)       # if the start value (field 2) of the previous line 
                                     # is less or equal to the current line and the end
                                     # value of the previous line is greater than or
                                     # equal to field 3 of the current line
        next                         # then do nothing and read the next line
      else                           # if there is no overlap,
        print previous               # then print the previous line
    }
    else {                           # if we are not already in the a sequence of 
                                     # "exon" lines, then this is the first one
      exon_seq=1                     # so exon_seq should become 1
    }
    previous=$0; start=$2; end=$3    # `start` become field2, `end` becomes field 3 and
                                     # `previous` becomes the current record (line)
  }
  END{                               # After all lines are processed
    if(previous) print previous      # If there still is a previous line, then print it
  }
' file
Related Question