Compare files and tell how similar they are

diff()

Is there a way to compare two file and give some kind of numeric indication of their similarity?

For example, if I have two files that differ by just one character (say, a character was deleted or changed), the program ought to say something like "file X differs by 1 character."

Or if two lines are different, say "file X differs by two lines."

The best output would be something like "File X is 95% similar to file Y"

Best Answer

One approach could be to compute the Levenshtein distance.

Here using the Text::LevenshteinXS perl module:

distance() {
  perl -MText::LevenshteinXS  -le 'print distance(@ARGV)' "$@"
}

Then:

$ distance foo foo
0
$ distance black blink
2
$ distance "$(cat /etc/passwd)" "$(tr a b < /etc/passwd)"
177

Here's a line-based implementation of the Levenshtein distance in awk (computes the distance in terms of number of inserted/deleted/modified lines instead of characters):

awk '
  {if (NR==FNR) s[++m]=$0; else t[++n]=$0}
  function min(x, y) {
    return x < y ? x : y
  }
  END {
    for(i=0;i<=m;i++) d[i,0] = i
    for(j=0;j<=n;j++) d[0,j] = j

    for(i=1;i<=m;i++) {
      for(j=1;j<=n;j++) {
        c = s[i] != t[j]
        d[i,j] = min(d[i-1,j]+1,min(d[i,j-1]+1,d[i-1,j-1]+c))
      }
    }
    print d[m,n]
  }' file1 file2

You may also be interested in diffstat's output:

$ diff -u /etc/passwd <(tr a b < /etc/passwd) | diffstat
 13 |  114 ++++++++++++++++++++++++++++++++++-----------------------------------
 1 file changed, 57 insertions(+), 57 deletions(-)
Related Question