How to Compare Directories with Binary Files

file-comparisonfiles

I'd like to compare directories with binary files. Actually, I'm not interested in what the actual differences between files are, but to know if there's a differ (and what files differ). Previously I used meld, but it's cannot compare binary files.

What such file comparison tool can do this?

NOTE: It doesn't matter if it's a graphical tool or is just has a command-line.

Best Answer

This can easily be done with diff. For example:

$ ls -l foo/
total 2132
-rwxr-xr-x 1 terdon terdon 1029624 Nov 18 13:13 bash
-rwxr-xr-x 1 terdon terdon 1029624 Nov 18 13:13 bash2
-rwxr-xr-x 1 terdon terdon  118280 Nov 18 13:13 ls

$ ls -l bar/
total 1124
-rwxr-xr-x 1 terdon terdon 1029624 Nov 18 13:14 bash
-rwxr-xr-x 1 terdon terdon  118280 Nov 18 13:14 ls

$ diff bar/ foo/
Only in foo/: bash2

In the example above, the foo/ and bar/ directories contain binary files and bash2 is only in foo/.

So, you could run something simple like:

$ diff bar/ foo/ && echo "The directories' contents are identical"

That will show you the different files, if any, or print "The directories' contents are identical" if they are. To compare subdirectories and any files they may contain as well, use diff -r. Combine it with -q to suppress the output for text files.

Related Solutions

A shell-like environment for binary processing

I do have the exact same problem than you for years as well.

For simple non-interactive uses, I like to use the binary block editor BBE. BBE is to binary as SED is to text, including its archaic syntax and simplicity, however, it has a lot of features missing from what I often need, so I have to combine it with other tools. So, BBE is only a partial solution. Also note that BBE hasn't had any updates or improvements for years.

Of course one can use xxd before and xxd -r after editing the data with text-based tools, but that won't work when the data in question is large and random access is required, for example when processing block devices.

(Note: For Windows, there is at least the costly, proprietary WinHex scripting language, but that won't get us anywhere.)

For more complicated binary editing, I usually fall back to Python as well, even though it sometimes is too slow for large files, which is it's main drawback. I hope Pyston (Python employing LLVM to compile to optimized machine code) will someday mature enough to be usable, or even better, someone will design and implement a free compact, fast and versatile binary processing scripting language, which AFAIK doesn't exist for U*IX like systems yet.

UPDATE

I also happen to use the homebrew, open source Intel x86 assembler flat assembler, or fasm for short, that evolved into much more than just an assembler.

It has a powerful, textblock-based macro preprocessor (itself a turing complete language) with a syntax in the tradition of the borland turbo assembler macro language, but much more advanced.

Also, it has a data manipulation language, which allows to binary include arbitrary files, do all kinds of binary and arithmetic manipulation on it (integer only) at "compile time" and write the result into an output file. This data manipulation language has control strutures and is also turing complete.

It is much easier to use than writing a program that does some binary manipulation in C and probably even in python. Plus, it loads blindingly fast, as it is a small sized executable with almost no external dependecies (There are 2 versions: either it only requires libc or it can run as a static executable directly on the Linux kernel ABI).

It does have some ruff edges, like

not supporting concurrency
being writting in 32 bit x86 assembly (works on x86_64 though), you probably need qemu or a similar emulator if you want to run it on anything else than x86 or x86_64
it's powerful macro preprocessor language is turing complete, this means you better have some experience with languages like Lisp, Haskell, XSLT, or probably M4 would be the best choice.
all data that is to be written into the output file are performed in a "flat" buffer in memory, and this buffer can grow but not shrink until the output file has been written and fasm terminated. This means that one can only generate files at most as large as you have main memory left in a single run of fasm.
data can only be written into a single output file for each run of fasm
yeah, it is homebrew, a really neat and clever one though

How to remove text matching specific patterns from a file

If your timestamps are consistently formated, you could strip them off (with sed, for example) before processing the files with whatever differencing method, e.g.

diff <(sed -E 's|[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{2,4} [0-9]{1,} ||' fileA) <(sed -E 's|[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{2,4} [0-9]{1,} ||' fileB)

Testing on your supplied input files:

$ diff \
<(sed -E 's|[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{2,4} [0-9]{1,} ||' fileA) \
<(sed -E 's|[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{2,4} [0-9]{1,} ||' fileB)
2,3c2,3
< abc xxx
< ghi eee ddd
---
> abc def
> ghi fff ddd

Best Answer

Related Solutions

A shell-like environment for binary processing

How to remove text matching specific patterns from a file

Related Question