How to merge the lines of two files by having common headers

awksedtext processing

I want to merge two files based on the common data present in them as header.

Following is the example

File1

>Feature scaffold1
1   100  g
101 200  g
201 300  g
>Feature scaffold2
1   100  g
01  500  g
>Feature scaffold3
10  500  g
>Feature scaffold4
10  300  g

File 2

>Feature scaffold1
500 500 r
900 1000    r
>Feature scaffold2
200 300 r
>Feature scaffold3
100 200 r
>Feature scaffold4
500 600 r
>Feature scaffold5
1   1000    r

And here's the kind of output I want:

>Feature scaffold1
1   100 g
101 200 g
201 300 g
500 500 r
900 1000    r
>Feature scaffold2
1   100 g
01  500 g
200 300 r
>Feature scaffold3
10  500 g
100 200 r
>Feature scaffold4
10  300 g
500 600 r
>Feature scaffold5
1   1000    r

I have tried some awk and sed but clearly have not been successful, how can I do this?

Best Answer

Awk solution:

awk '/^>/{ k=$1 FS $2 }
     NR==FNR{ 
         if (!/^>/) a[k]=(a[k]!="")? a[k] ORS $0: $0; next
     }
     k in a{ 
         print $0 ORS a[k]; delete a[k]; next 
     }1' file1 file2
  • /^>/{ k=$1 FS $2 } - on encountering header line(i.e. >Feature ...) - compose a key k from the 1st $1 and 2nd $2 fields
  • NR==FNR{ ... } - processing the 1st input file (file1):
    • if (!/^>/) a[k]=(a[k]!="")? a[k] ORS $0: $0 - accumulate non-header lines into array a using current key k
    • next - jump to next record
  • k in a - if current key based on file2 record is in array a(based on file1 records):
    • print $0 ORS a[k] - print related records
    • delete a[k] - delete processed item(s)

The output:

>Feature scaffold1
1   100  g
101 200  g
201 300  g
500 500 r
900 1000    r
>Feature scaffold2
1   100  g
01  500  g
200 300 r
>Feature scaffold3
10  500  g
100 200 r
>Feature scaffold4
10  300  g
500 600 r
>Feature scaffold5
1   1000    r
Related Question