Over in an answer to a different question, I wanted to use a structure much like this to find files that appear in list2
that do not appear in list1
:
( cd dir1 && find . -type f -print0 ) | sort -z > list1
( cd dir2 && find . -type f -print0 ) | sort -z > list2
comm -13 list1 list2
However, I hit a brick wall because my version of comm
cannot handle NULL-terminated records. (Some background: I'm passing a computed list to rm
, so I particularly want to be able to handle file names that could contain an embedded newline.)
If you want an easy worked example, try this
mkdir dir1 dir2
touch dir1/{a,b,c} dir2/{a,c,d}
( cd dir1 && find . -type f ) | sort > list1
( cd dir2 && find . -type f ) | sort > list2
comm -13 list1 list2
Without NULL-terminated lines the output here is the single element ./d
that appears only in list2
.
I'd like to be able to use find ... -print0 | sort -z
to generate the lists.
How can I best reimplement an equivalent to comm
that outputs the NULL-terminated records that appear in list2
but that do not appear in list1
?
Best Answer
GNU
comm
(as of GNU coreutils 8.25) now has a-z
/--zero-terminated
option for that.For older versions of GNU
comm
, you should be able to swap NUL and NL:That way
comm
still works with newline-delimited records, but with actual newlines in the input encoded as NULs, so we're still safe with filenames containing newlines.You may also want to set the locale to
C
because on GNU systems and most UTF-8 locales at least, there are different strings that sort the same and would cause problems here¹.That's a very common trick (see Invert matching lines, NUL-separated for another example with
comm
), but needs utilities that support NUL in their input, which outside of GNU systems is relatively rare.¹ Example:
(2019 edit: The relative order of ①②③ has been fixed in newer versions of the GNU libc, but you can use ? ? ? instead for instance in newer versions (2.30 at least) that still have the problem like 95% of Unicode code points)