Ubuntu – How to extract multiple bits of information that appear on different lines within the same text file

command lineextracttext processing

I am trying to extract the sequence ID and cluster number that occur on different lines within the same text file.

The input looks like

>Cluster 72
0   319aa, >O311_01007... *
>Cluster 73
0   318aa, >1494_00753... *
1   318aa, >1621_00002... at 99.69%
2   318aa, >1622_00575... at 99.37%
3   318aa, >1633_00422... at 99.37%
4   318aa, >O136_00307... at 99.69%
>Cluster 74
0   318aa, >O139_01028... *
1   318aa, >O142_00961... at 99.69%
>Cluster 75
0   318aa, >O300_00856... *

The desired output is the sequence ID in one column and the corresponding cluster number in the second.

>O311_01007  72
>1494_00753  73
>1621_00002  73
>1622_00575  73
>1633_00422  73
>O136_00307  73
>O139_01028  74
>O142_00961  74
>O300_00856  75

Can anyone help with this?

Best Answer

With awk:

awk -F '[. ]*' 'NF == 2 {id = $2; next} {print $3, id}' input-file
  • we split fields on spaces or periods with -F '[. ]*'
  • with lines of two fields, (the >Cluster lines), save the second field as the ID and move to the next line
  • with other lines, print the third field and the saved ID