I have list of strings in file A
and file B
. I want to take each string in file A and find the most similar string in file B.
For this, I am looking for a tool that provides fuzzy comparing.
for example:
$ fuzzy_compare "Some string" "Some string"
100
Where 100 is some equality ratio. For example Levenshtein distance.
Is there any utility? I don't want to reinvent the wheel.
Best Answer
I found this page which provides implementations of the Levenshtein distance algorithm in different languages. So, for example in bash, you could do:
Save that as
~/bin/levenshtein.sh
, make it executable (chmod a+x ~/bin/levenshtein.sh
) and run it on your two files. For example:That's fine for a few patterns but will get very slow for larger files. If that's an issue, try one of the implementations in other languages. For example Perl:
As above, save the script as
~/bin/levenshtein.pl
and make it executable and run it with the two files as arguments:Even in the very small files used here, the Perl approach is 10 times faster than the bash one: