Shell – Efficiently merge / sort / unique large number of text files

shellsortuniq

I am trying a naive:

$ cat * | sort -u > /tmp/bla.txt

which fails with:

-bash: /bin/cat: Argument list too long

So in order to avoid a silly solution like (creates an enormous temporary file):

$ find . -type f -exec cat {} >> /tmp/unsorted.txt \;
$ cat /tmp/unsorted.txt | sort -u > /tmp/bla.txt

I though I could process files one by one using (this should reduce memory consumption, and be closer to a streaming mechanism):

$ cat proc.sh
#!/bin/sh
old=/tmp/old.txt
tmp=/tmp/tmp.txt
cat $old "$1" | sort -u > $tmp
mv $tmp $old

Followed then by:

$ touch /tmp/old.txt
$ find . -type f -exec /tmp/proc.sh {} \;

Is there a simpler more unix-style replacement for: cat * | sort -u when the number of files reach MAX_ARG ? It feels akward writing a small shell script for such a common task.

Best Answer

With GNU sort, and a shell where printf is built-in (all POSIX-like ones nowadays except some variants of pdksh):

printf '%s\0' * | sort -u --files0-from=- > output

Now, a problem with that is that because the two components of that pipeline are run concurrently and independently, by the time the left one expands the * glob, the right one may have created the output file already which could cause problem (maybe not with -u here) as output would be both an input and output file, so you may want to have the output go to another directory (> ../output for instance), or make sure the glob doesn't match the output file.

Another way to address it in this instance is to write it:

printf '%s\0' * | sort -u --files0-from=- -o output

That way, it's sort opening output for writing and (in my tests), it won't do it before it has received the full list of files (so long after the glob has been expanded). It will also avoid clobbering output if none of the input files are readable.

Another way to write it with zsh or bash

sort -u --files0-from=<(printf '%s\0' *) -o output

That's using process substitution (where <(...) is replaced by a file path that refers to the reading end of the pipe printf is writing to). That feature comes from ksh, but ksh insists in making the expansion of <(...) a separate argument to the command so you can't use it with the --option=<(...) syntax. It would work with this syntax though:

sort -u --files0-from <(printf '%s\0' *) -o output

Note that you'll see a difference from approaches that feed the output of cat on the files in cases where there are files that don't end in a newline character:

$ printf a > a
$ printf b > b
$ printf '%s\0' a b | sort -u --files0-from=-
a
b
$ printf '%s\0' a b | xargs -r0 cat | sort -u
ab

Also note that sort sorts using the collation algorithm in the locale (strcollate()), and sort -u reports one of each set of lines that sort the same by that algorithm, not unique lines at byte level. If you only care about lines being unique at byte level and don't care so much about the order they're sorted on, you may want to fix the locale to C where the sorting is based on byte values (memcmp(); that would probably speed things up significantly):

printf '%s\0' * | LC_ALL=C sort -u --files0-from=- -o output

Related Solutions

How to sort access log efficiently in blocks

Try split --filter:

split --lines 1000 --filter 'sort ... | sed ... | uniq -c' access.log

This will split access.log into chunks of 1000 lines and pipe each chunk through the given filter.

If you want to save the results for each chunk separately, you can use $FILE in the filter command and possibly specify a prefix (default is x):

split --lines 1000 --filter '... | uniq -c >$FILE' access.log myanalysis-

This will generate a file myanalysis-aa containing the result of processing the first chunk, myanalysis-ab for the second chunk, etc.

The --filter option to split was introduced in GNU coreutils 8.13 (released in September 2011).

Sort and uniq columns individually in a text file

You can try something like this:

paste -d'\t' <(cut -f 1 -d' ' file | sort -u) <(cut -f 2 -d' ' file | sort -u) <(cut -f 3 -d' ' file | sort -u) <(cut -f 4 -d' ' file | sort -u) >output

I put tab as delimiter of paste to be more visible the output.

Best Answer

Related Solutions

How to sort access log efficiently in blocks

Sort and uniq columns individually in a text file

Related Question