Linux – How to Select Every Two Rows if They Begin with the Same Name

awklinuxperlpythonr

I have a table that looks like this:

     name                             something 
1    100036498|F|0--20:T>G            something
2    100036501|F|0--44:C>T            something     
3    100036501|F|0-44:C>T-44:C>T      something   
4    100036508|F|0--66:T>G            something  
5    100036508|F|0-66:T>G-66:T>G      something  
6    100036511|F|0-19:G>A-19:G>A      something 
7    100036516|F|0--15:T>G            something 
8    100036516|F|0-15:T>G-15:T>G      something 
           ...                         ....

I added the line numbers to make more easy to follow my question. There are some pairs of rows that begin with the same number like rows 2 and 3, 4 and 5, 7 and 8. There are also rows that hare unique like rows 1 and 6. I would like to conserve only rows that have a pair or in other words eliminate lines that do not have a pair to have a table like this one:

     name                             something 
2    100036501|F|0--44:C>T            something     
3    100036501|F|0-44:C>T-44:C>T      something   
4    100036508|F|0--66:T>G            something  
5    100036508|F|0-66:T>G-66:T>G      something   
7    100036516|F|0--15:T>G            something 
8    100036516|F|0-15:T>G-15:T>G      something 
           ...                         ....

I want something like the opposite of the linux command uniq taking in to account only the numbers of the first column not the rest after simbole |.

Do you know how to do it?

Below is the same first table with the columns separated by one space and without header to make it more easy to copy.

100036498|F|0--20:T>G something
100036501|F|0--44:C>T something     
100036501|F|0-44:C>T-44:C>T something
100036508|F|0--66:T>G something
100036508|F|0-66:T>G-66:T>G something
100036511|F|0-19:G>A-19:G>A something
100036516|F|0--15:T>G something
100036516|F|0-15:T>G-15:T>G something 

Best Answer

this is an awk solution, which it's keeping the lines where those are repeated more than once, if you want those only repeated two times exactly, change >1 to ==2

awk -F'|' 'NR==FNR{s[$1]++;next} (s[$1]>1)' infile infile
100036501|F|0--44:C>T            something
100036501|F|0-44:C>T-44:C>T      something
100036508|F|0--66:T>G            something
100036508|F|0-66:T>G-66:T>G      something
100036516|F|0--15:T>G            something
100036516|F|0-15:T>G-15:T>G      something
Related Question