Sed Awk Perl – Match String ‘abcedf’ to ‘bafcde’ in One Line Command

awkperlsed

I am planning to implement an indexing structure in my program. For example, if I have 100 rows in the table, I will number these rows from 1 to 100 in another column by appending an _ to the end of the number.(1_,2_,3_ etc so that each number can be identified uniquely).

After processing the rows, I am storing the output into a file.

For example, I insert the line 1_,2_,4_,5_ into a file.

if I get a value as 5_,2_,1_,4_ or 2_,5_,1_,4_, I should not insert those values.

An implementation that comes to my mind is, to sort the numbers and then compare them. However, if the total rows becomes 100,000 it won't be a good solution. Is it possible as a single line command in perl script or awk or sed?

EDIT:

To be more precise and short, for a set of unique and distinct values, how can I find all the combinations without repetitions?

Example:

If I have 3 unique keys 1,2 and 3, how can I find all combinations without the same combination repeated twice?

So for the above example, we can find a combination as,

Now, when I search for 213 or 321 it should give me a match as I already have the combination 123 obtained.

Best Answer

You could setup a SQLite database and perform SQL selects from that, which would probably be cleaner to implement and would set you up for being more portable later on.

But here's a rough idea. Say I have 2 files:

$ more index.txt new_vals.txt 
::::::::::::::
index.txt
::::::::::::::
1_,2_,4_,5_
::::::::::::::
new_vals.txt
::::::::::::::
5_,2_,1_,4
2_,5_,1_,4

With this command we can match:

$ for i in $(<new_vals.txt); do nums=${i//_,/}; \
        grep -oE "[${nums}_,]+" index.txt; done
1_,2_,4_,5_
1_,2_,4_,5_

This demonstrates that we can match each line from new_vals.txt to an existing line in index.txt.

UPDATE #1

Based on the OP's edit the following would do what he wants using a modification of the above approach.

$ for i in $(<new_vals.txt); do 
  nums=${i//_,/} 

  printf "# to check: [%s]" $i
  k=$(grep -oE "[${nums}_,]+" index.txt | grep "[[:digit:]]_$")
  printf " ==> match: [%s]\n" $k

done

With a modified version of test data:

$ more index.txt new_vals.txt 
::::::::::::::
index.txt
::::::::::::::
1_,2_,4_,5_
0_,2_,3_,9_
::::::::::::::
new_vals.txt
::::::::::::::
5_,2_,1_,4_
2_,5_,1_,4_
1_,1_,1_,1_
1_,2_,4_,4_

Now when we run the above (put inside a script for simplicity, parser.bash):

$ ./parser.bash 
# to check: [5_,2_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [2_,5_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [1_,1_,1_,1_] ==> match: []
# to check: [1_,2_,4_,4_] ==> match: []

How it works

The above method works by exploiting some key characteristics exhibited by the nature of your data. For example. Only matches will end with a digit followed by a underscore. The grep "[[:digit:]]_$" picks only these results out.

The other part of the script, grep -oE "[${nums}_,]+" index.txt will pick out lines that contain characters from strings in the file new_vals.txt which match strings from index.txt.

Additional adjustments

If the nature of the data is such that strings may be variable in length then the 2nd grep will need to be expanded to guarantee that we're only picking out strings that are of sufficient length. There are several ways to accomplish this, either by expanding the pattern or by making use of a counter, perhaps using wc or some other means, that would confirm that the matches are of a certain type.

Expanding it like so:

k=$(grep -oE "[${nums}_,]+" index.txt | \
    grep "[[:digit:]]_,[[:digit:]]_,[[:digit:]]_,[[:digit:]]_$")

Would allow for the elimination of strings like this:

$ ./parser2.bash 
# to check: [5_,2_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [2_,5_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [1_,1_,1_,1_] ==> match: []
# to check: [1_,2_,4_,4_] ==> match: []
# to check: [1_,2_,5_] ==> match: []

Related Solutions

How to print incremental count of occurrences of unique values in column 1

The standard trick for this kind of problem in Awk is to use an associative counter array:

awk '{ print $0 "\t" ++count[$1] }'

This counts the number of times the first word in each line has been seen. It's not quite what you're asking for, since

Apple_1   1      300
Apple_2   1      500
Apple_1   500    1500

would produce

Apple_1   1      300     1
Apple_2   1      500     1
Apple_1   500    1500    2

(the count for Apple_1 isn't reset when we see Apple_2), but if the input is sorted you'll be OK.

Otherwise you'd need to track a counter and last-seen key:

awk '{ if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 "\t" counter }'

Bash – Keeping First Instance of Duplicates

sort itself should suffice. First sort such that rows are "grouped" by field range 3-6, records within each group further ordered by fields 5 and 1. Pipe this to sort -u on 3-6, this disables last-resort comparison and returns the first record from each 3-6 group. Finally, pipe this to sort, this time by fields 5 and 1

sort -k3,6 -k5,5r -k1,1r file | sort -k3,6 -u | sort -k5,5r -k1,1r
A B C D E F G
1 2 T TACA A 3 2 Q
9 3 A C 9 3 P
8 3 I R 8 2 Q