Using comm with NULL-terminated records

commfindsort

Over in an answer to a different question, I wanted to use a structure much like this to find files that appear in list2 that do not appear in list1:

( cd dir1 && find . -type f -print0 ) | sort -z > list1
( cd dir2 && find . -type f -print0 ) | sort -z > list2
comm -13 list1 list2

However, I hit a brick wall because my version of comm cannot handle NULL-terminated records. (Some background: I'm passing a computed list to rm, so I particularly want to be able to handle file names that could contain an embedded newline.)

If you want an easy worked example, try this

mkdir dir1 dir2
touch dir1/{a,b,c} dir2/{a,c,d}
( cd dir1 && find . -type f ) | sort > list1
( cd dir2 && find . -type f ) | sort > list2
comm -13 list1 list2

Without NULL-terminated lines the output here is the single element ./d that appears only in list2.

I'd like to be able to use find ... -print0 | sort -z to generate the lists.

How can I best reimplement an equivalent to comm that outputs the NULL-terminated records that appear in list2 but that do not appear in list1?

Best Answer

GNU comm (as of GNU coreutils 8.25) now has a -z/--zero-terminated option for that.

For older versions of GNU comm, you should be able to swap NUL and NL:

comm -13 <(cd dir1 && find . -type f -print0 | tr '\n\0' '\0\n' | sort) \
         <(cd dir2 && find . -type f -print0 | tr '\n\0' '\0\n' | sort) |
  tr '\n\0' '\0\n'

That way comm still works with newline-delimited records, but with actual newlines in the input encoded as NULs, so we're still safe with filenames containing newlines.

You may also want to set the locale to C because on GNU systems and most UTF-8 locales at least, there are different strings that sort the same and would cause problems here¹.

That's a very common trick (see Invert matching lines, NUL-separated for another example with comm), but needs utilities that support NUL in their input, which outside of GNU systems is relatively rare.


¹ Example:

$ touch dir1/{①,②} dir2/{②,③}
$ comm -12 <(cd dir1 && find . -type f -print0 | tr '\n\0' '\0\n' | sort) \
           <(cd dir2 && find . -type f -print0 | tr '\n\0' '\0\n' | sort)  
./③
./②
$ (export LC_ALL=C
    comm -12 <(cd dir1 && find . -type f -print0 | tr '\n\0' '\0\n' | sort) \
             <(cd dir2 && find . -type f -print0 | tr '\n\0' '\0\n' | sort))
./②

(2019 edit: The relative order of ①②③ has been fixed in newer versions of the GNU libc, but you can use ? ? ? instead for instance in newer versions (2.30 at least) that still have the problem like 95% of Unicode code points)

Related Question