Shell – Efficiently merge / sort / unique large number of text files

shellsortuniq

I am trying a naive:

$ cat * | sort -u > /tmp/bla.txt

which fails with:

-bash: /bin/cat: Argument list too long

So in order to avoid a silly solution like (creates an enormous temporary file):

$ find . -type f -exec cat {} >> /tmp/unsorted.txt \;
$ cat /tmp/unsorted.txt | sort -u > /tmp/bla.txt

I though I could process files one by one using (this should reduce memory consumption, and be closer to a streaming mechanism):

$ cat proc.sh
#!/bin/sh
old=/tmp/old.txt
tmp=/tmp/tmp.txt
cat $old "$1" | sort -u > $tmp
mv $tmp $old

Followed then by:

$ touch /tmp/old.txt
$ find . -type f -exec /tmp/proc.sh {} \;

Is there a simpler more unix-style replacement for: cat * | sort -u when the number of files reach MAX_ARG ? It feels akward writing a small shell script for such a common task.

Best Answer

With GNU sort, and a shell where printf is built-in (all POSIX-like ones nowadays except some variants of pdksh):

printf '%s\0' * | sort -u --files0-from=- > output

Now, a problem with that is that because the two components of that pipeline are run concurrently and independently, by the time the left one expands the * glob, the right one may have created the output file already which could cause problem (maybe not with -u here) as output would be both an input and output file, so you may want to have the output go to another directory (> ../output for instance), or make sure the glob doesn't match the output file.

Another way to address it in this instance is to write it:

printf '%s\0' * | sort -u --files0-from=- -o output

That way, it's sort opening output for writing and (in my tests), it won't do it before it has received the full list of files (so long after the glob has been expanded). It will also avoid clobbering output if none of the input files are readable.

Another way to write it with zsh or bash

sort -u --files0-from=<(printf '%s\0' *) -o output

That's using process substitution (where <(...) is replaced by a file path that refers to the reading end of the pipe printf is writing to). That feature comes from ksh, but ksh insists in making the expansion of <(...) a separate argument to the command so you can't use it with the --option=<(...) syntax. It would work with this syntax though:

sort -u --files0-from <(printf '%s\0' *) -o output

Note that you'll see a difference from approaches that feed the output of cat on the files in cases where there are files that don't end in a newline character:

$ printf a > a
$ printf b > b
$ printf '%s\0' a b | sort -u --files0-from=-
a
b
$ printf '%s\0' a b | xargs -r0 cat | sort -u
ab

Also note that sort sorts using the collation algorithm in the locale (strcollate()), and sort -u reports one of each set of lines that sort the same by that algorithm, not unique lines at byte level. If you only care about lines being unique at byte level and don't care so much about the order they're sorted on, you may want to fix the locale to C where the sorting is based on byte values (memcmp(); that would probably speed things up significantly):

printf '%s\0' * | LC_ALL=C sort -u --files0-from=- -o output