Compare files and tell how similar they are

diff()

Is there a way to compare two file and give some kind of numeric indication of their similarity?

For example, if I have two files that differ by just one character (say, a character was deleted or changed), the program ought to say something like "file X differs by 1 character."

Or if two lines are different, say "file X differs by two lines."

The best output would be something like "File X is 95% similar to file Y"

Best Answer

One approach could be to compute the Levenshtein distance.

Here using the Text::LevenshteinXS perl module:

distance() {
  perl -MText::LevenshteinXS  -le 'print distance(@ARGV)' "$@"
}

Then:

$ distance foo foo
0
$ distance black blink
2
$ distance "$(cat /etc/passwd)" "$(tr a b < /etc/passwd)"
177

Here's a line-based implementation of the Levenshtein distance in awk (computes the distance in terms of number of inserted/deleted/modified lines instead of characters):

awk '
  {if (NR==FNR) s[++m]=$0; else t[++n]=$0}
  function min(x, y) {
    return x < y ? x : y
  }
  END {
    for(i=0;i<=m;i++) d[i,0] = i
    for(j=0;j<=n;j++) d[0,j] = j

    for(i=1;i<=m;i++) {
      for(j=1;j<=n;j++) {
        c = s[i] != t[j]
        d[i,j] = min(d[i-1,j]+1,min(d[i,j-1]+1,d[i-1,j-1]+c))
      }
    }
    print d[m,n]
  }' file1 file2

You may also be interested in diffstat's output:

$ diff -u /etc/passwd <(tr a b < /etc/passwd) | diffstat
 13 |  114 ++++++++++++++++++++++++++++++++++-----------------------------------
 1 file changed, 57 insertions(+), 57 deletions(-)

Related Solutions

Compare an old file and new file, but ignore lines which only exist in new file

Use join to combine matching lines from the two files. Assuming the file names come after the checksums (as in md5sum output) and don't contain whitespace, this will print all file names that are present in both lists, together with the old checksum and the new checksum:

join -1 2 -2 2 <(sort -k 2 oldlist) <(sort -k 2 newlist)

To also see new files, pass the -a option to join. A bit of output postprocessing will remove the file names for which the checksum has not changed.

join -a 2 -1 2 -2 2 <(sort -k 2 oldlist) <(sort -k 2 newlist) |
awk '$2 != $3'

File Comparison – Compare Two Files Strictly Line-by-Line

This could be an approach:

diff <(nl file1) <(nl file2)

With nl number the lines that diff recognizes the lines line by line.

Best Answer

Related Solutions

Compare an old file and new file, but ignore lines which only exist in new file

File Comparison – Compare Two Files Strictly Line-by-Line

Related Question