Linux – tool to measure file difference percentage

command linediff()linux

I am looking to compare two text files. Normally, I can just use diff to compare the two files to see the differences. This is great, except that I am more concerned with the percentage difference of the two files.

For example:

File A:
    banana
    TESTING

File B:
    TESTING

In this case, the result would be a 50% difference. I've taken a look at wdiff, and it mostly works, with the exception being that it looks at elements word-by-word (in fact, I can get the result above by doing wdiff -s filea fileb).

Does a tool exist to provide file percentage difference on a by character/ by byte level?

Best Answer

Doing a character-by-character comparison of two text files is effectively a Levenshtein distance calculation. There isn't a common standalone program in Linux that will do this calculation, but there are some library functions (I know PHP has one) and tons of example code online for this calculation.

One other little caveat is that Levenshtein distance is strictly the number of changes between two strings, so if you're looking for a percentage, you'll need to normalize the calculated distance. Dividing by the mean of the lengths of the two strings (sizes of the text files) is a widely-used normalization.

Related Question