Linux – How to diff large files on Linux

diff()linux

I'm getting a diff: memory exhausted error when trying to diff two 27 GB files that are largely similar on a Linux box with CentOS 5 and 4 GB of RAM. This is a known problem, it seems.

I would expect there to be an alternative for such an essential utility, but I can't find one. I imagine the solution would have to use temporary files rather than memory to store the information it needs.

  • I tried to use rdiff and xdelta, but they are better for showing the changes between two files, like a patch, and are not that useful for inspecting the differences between two files.
  • Tried VBinDiff, but it is a visual tool which is better for comparing binary files. I need something that can pipe the differences to STDOUT like regular diff.
  • There are a lot of other utilities such as vimdiff that only work with smaller files.
  • I've also read about Solaris bdiff but I could not find a port for Linux.

Any ideas besides splitting the file into smaller pieces? I have 40 of these files so trying to avoid the work of breaking them up.

Best Answer

cmp does things byte-by-byte, so it probably won't run out of memory (just tested it on two 7 GB files) -- but you might be looking for more detail than a list of "files X and Y differ at byte x, line y". If the similarities of your files are offset (e.g., file Y has an identical block of text, but not at the same location), you can pass offsets to cmp; you could probably turn it into a resynchronizing compare with a small script.

Aside: In case anyone else lands here when looking for a way to confirm that two directory structures (containing very large files) are identical: diff --recursive --brief (or diff -r -q for short, or maybe even diff -rq) will work and not run out of memory.

Related Question