Finding number of lines using find command

command linefindwc

Consider the R source code repository found at https://cloud.r-project.org/src/base/R-3/R-3.4.4.tar.gz . I unzipped the repository into a folder. Now, I wish to find out how many total lines are there in the directory. So, I tried the following command:

find . -type f -exec wc -l {} \+

which yields 394968 but if I try the following command:

find . -type f -exec cat {} \+ | wc -l

it yields 1848857!

Why are these two seemingly similar operations of the find command producing such drastically different results? And, what is the correct way to go about finding number of lines, preferably using command line utilities instead of scripting a small tool?

Best Answer

The first command you mention, find . -type f -exec wc -l {} +, really says "run wc -l on as many files as possible, until all of them have been processed". This can run wc multiple times!

On the other hand, find . -type f -exec cat {} + | wc -l can run cat several times, but will only run wc once. (More in detail, this is because in this case cat is called by find, which can and does decide to run it however many times it wants, whereas the part after the pipe character, wc -l, is beyond the reach of find, and is therefore run by your shell, just once.)

You say that the first command "yields 394968", but it really does not; on my system its output ends with:

(Many more lines elided...)
     23 ./po/Makefile.win
     64 ./po/README
      1 ./VERSION-NICK
     97 ./README
 258450 total

Yet, by adding grep total, one can see that wc was really run twice:

$ find . -type f -exec wc -l {} + | grep total
 1590407 total
 258450 total

And, indeed, 1590407 plus 258450 is 1848857, which agrees with the second command.


An explanation of why wc was run more than once in the find -exec wc + version of the command is vaguely hinted at by the find man page:

-exec command {} +

    This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total number of invocations of the command will be much less than the number of matched files.  The command line is built in much the same way that xargs builds its command lines.

Note how this says "much less than ..." rather than "only once". The documentation for xargs hints that its option --max-chars is set automatically if not set by the user:

--max-chars=max-chars
-s max-chars

    Use at most max-chars characters per command line, including the command and initial-arguments and the terminating nulls at the ends of the argument strings.  The largest allowed value is system-dependent, and is calculated as the argument length limit for exec, less the size of your environment, less 2048 bytes of headroom. If this value is more than 128KiB, 128Kib is used as the default value; otherwise, the default value is the maximum.

This limits how many filenames can be passed to a single call to wc, explaining why, for large numbers of files, several calls to wc will occur, each operating on a partition of the input.

Related Question