Linux – How to Select Every Two Rows if They Begin with the Same Name

awklinuxperlpythonr

I have a table that looks like this:

     name                             something

1    100036498|F|0--20:T>G            something
2    100036501|F|0--44:C>T            something     
3    100036501|F|0-44:C>T-44:C>T      something   
4    100036508|F|0--66:T>G            something  
5    100036508|F|0-66:T>G-66:T>G      something  
6    100036511|F|0-19:G>A-19:G>A      something 
7    100036516|F|0--15:T>G            something 
8    100036516|F|0-15:T>G-15:T>G      something 
           ...                         ....

I added the line numbers to make more easy to follow my question. There are some pairs of rows that begin with the same number like rows 2 and 3, 4 and 5, 7 and 8. There are also rows that hare unique like rows 1 and 6. I would like to conserve only rows that have a pair or in other words eliminate lines that do not have a pair to have a table like this one:

     name                             something

2    100036501|F|0--44:C>T            something     
3    100036501|F|0-44:C>T-44:C>T      something   
4    100036508|F|0--66:T>G            something  
5    100036508|F|0-66:T>G-66:T>G      something   
7    100036516|F|0--15:T>G            something 
8    100036516|F|0-15:T>G-15:T>G      something 
           ...                         ....

I want something like the opposite of the linux command uniq taking in to account only the numbers of the first column not the rest after simbole |.

Do you know how to do it?

Below is the same first table with the columns separated by one space and without header to make it more easy to copy.

100036498|F|0--20:T>G something
100036501|F|0--44:C>T something     
100036501|F|0-44:C>T-44:C>T something
100036508|F|0--66:T>G something
100036508|F|0-66:T>G-66:T>G something
100036511|F|0-19:G>A-19:G>A something
100036516|F|0--15:T>G something
100036516|F|0-15:T>G-15:T>G something

Best Answer

this is an awk solution, which it's keeping the lines where those are repeated more than once, if you want those only repeated two times exactly, change >1 to ==2

awk -F'|' 'NR==FNR{s[$1]++;next} (s[$1]>1)' infile infile
100036501|F|0--44:C>T            something
100036501|F|0-44:C>T-44:C>T      something
100036508|F|0--66:T>G            something
100036508|F|0-66:T>G-66:T>G      something
100036516|F|0--15:T>G            something
100036516|F|0-15:T>G-15:T>G      something

UPDATE #1

Based on the OP's edit the following would do what he wants using a modification of the above approach.

$ for i in $(<new_vals.txt); do 
  nums=${i//_,/} 

  printf "# to check: [%s]" $i
  k=$(grep -oE "[${nums}_,]+" index.txt | grep "[[:digit:]]_$")
  printf " ==> match: [%s]\n" $k

done

With a modified version of test data:

$ more index.txt new_vals.txt 
::::::::::::::
index.txt
::::::::::::::
1_,2_,4_,5_
0_,2_,3_,9_
::::::::::::::
new_vals.txt
::::::::::::::
5_,2_,1_,4_
2_,5_,1_,4_
1_,1_,1_,1_
1_,2_,4_,4_

Now when we run the above (put inside a script for simplicity, parser.bash):

$ ./parser.bash 
# to check: [5_,2_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [2_,5_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [1_,1_,1_,1_] ==> match: []
# to check: [1_,2_,4_,4_] ==> match: []

How it works

The above method works by exploiting some key characteristics exhibited by the nature of your data. For example. Only matches will end with a digit followed by a underscore. The grep "[[:digit:]]_$" picks only these results out.

The other part of the script, grep -oE "[${nums}_,]+" index.txt will pick out lines that contain characters from strings in the file new_vals.txt which match strings from index.txt.

Additional adjustments

If the nature of the data is such that strings may be variable in length then the 2nd grep will need to be expanded to guarantee that we're only picking out strings that are of sufficient length. There are several ways to accomplish this, either by expanding the pattern or by making use of a counter, perhaps using wc or some other means, that would confirm that the matches are of a certain type.

Expanding it like so:

k=$(grep -oE "[${nums}_,]+" index.txt | \
    grep "[[:digit:]]_,[[:digit:]]_,[[:digit:]]_,[[:digit:]]_$")

Would allow for the elimination of strings like this:

$ ./parser2.bash 
# to check: [5_,2_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [2_,5_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [1_,1_,1_,1_] ==> match: []
# to check: [1_,2_,4_,4_] ==> match: []
# to check: [1_,2_,5_] ==> match: []

AWK – Merge 2 Rows Based on Same Column Values

A perl solution:

$ perl -ane '$h{$F[2]} .= " ".$F[0]." ".$F[1];
    END {
        for $k (sort keys %h) {
            print $_," " for grep {!$seen{$_}++} split(" ",$h{$k});
            print "$k\n";
        }
    }' file

47196436 47723284 name1
42672249 52856963 430695 name2
55094959 380983 name3
17926380 55584836 3213456 34211 54321 name4

Best Answer

Related Solutions

Sed Awk Perl – Match String ‘abcedf’ to ‘bafcde’ in One Line Command

UPDATE #1

How it works

Additional adjustments

AWK – Merge 2 Rows Based on Same Column Values

Related Question