Bash Diff – How to Get the Opposite of `diff -q` for Matching Identical Files

bashdiff()

I have a number of files in a directory, and I want to check that they are all unique. For simplicity, let's say I have three files: foo.txt, bar.txt and baz.txt. If I run this loop, I will check them all against each other:

$ for f in ./*; do for i in ./*; do diff -q "$f" "$i"; done; done
Files bar.txt and baz.txt differ
Files bar.txt and foo.txt differ
Files baz.txt and bar.txt differ
Files baz.txt and foo.txt differ
Files foo.txt and bar.txt differ
Files foo.txt and baz.txt differ

For the hundreds of files I want to deal with, this would become pretty unreadable; it would be better to list the files that do match, and then I can look over the list quickly and make sure that files are only matching themselves. From the manpage, I would have thought that the -s option would accomplish this:

$ for f in ./*; do for i in ./*; do diff -s "$f" "$i"; done; done
Files bar.txt and bar.txt are identical
Files baz.txt and baz.txt are identical
Files foo.txt and foo.txt are identical

…however, in fact it also prints out the whole contents of any files that differ. Is there any way to suppress this behaviour, so I only get the behaviour above?

Alternatively, is there some other tool that can accomplish this?

Best Answer

If you just want to check whether two files are identical or not, use cmp. To get an output only for identical files, you could use

for f in ./*; do for i in ./*; do cmp -s "$f" "$i" && echo "Files $f and $i are identical"; done; done

diff tries to produce a short, human-readable list of the differences, and this can take quite a lot of time, so avoid the overhead if you don't need it.

Related Solutions

Result of diff two files with switched lines says missing the same line twice

To understand the report, remember that diff is prescriptive, describing what changes need to be made to the first file (file1) to make it the same as the second file (file2).

Specifically, the d in 1d0 means delete and the a in 2a2 means add.

Thus:

1d0 means line 1 must be deleted in file1 (apples). 0 in 1d0 means line 0 is where they would have appeared in the second file (file2) had they not been deleted. That means when changing file2 to file1 (backwards) append line 1 of file1 after line 0 of file2.
2a2 means append the second line (oranges) from file2 to the now second line of file1 (after deleting the first line in file1, oranges switched to line 1)

Diff reports the same line as different in 2 files

My guess is you simply haven't sorted the files. That's one of the behaviors you can get on unsorted input:

$ cat file1 
foo
bar
$ cat file2
bar
foo
$ $ diff file1 file2
1d0
< foo
2a2
> foo

But, if you sort:

$ diff <(sort file1) <(sort file2)
$

The diff program's job is to tell you whether two files are identical and, if not, where they differ. It is not designed to find similarities between different lines. If line X of the one file is not the same as line X of the other, then the files are not the same. It doesn't matter if they contain exactly the same information, if that information is organized in a different way, the files are reported as different.

Best Answer

Related Solutions

Result of diff two files with switched lines says missing the same line twice

Diff reports the same line as different in 2 files

Related Question