Shell – Extra space with counted line number

shellstringwc

I count the number of lines of my file with this command on OSX:

nl=$(wc -l < ~/myfile.txt)

Say, nl turns out to be 100. Now, I wish to use the result nl in another command, but weirdly,

echo 1-$nl

gives me 1- 100 instead of 1-100.

Demo

cv me$ nl=$(wc -l < ~/Desktop/cap.xlsx)
cv me$ echo $nl
104
cv me$ echo 1-$nl
1- 104

enter image description here

Why does this happen? How may I get 1-100?

Best Answer

As POSIX defined, the output of wc shall contain an entry for each input file of the form:

"%d %d %d %s\n", <newlines>, <words>, <bytes>, <file>

But the output file format pseudo printf() string differs from the System V version of wc:

"%7d%7d%7d %s\n"

POSIX didn't require leading spaces to be added, so it's free for implementation to do what it want. There are different implementations of wc, at least with OSX and wc from heirloom tools chest, it added leading spaces to output.

$ /usr/5bin/wc -l /tmp/file
      3  /tmp/file

GNU wc also add leading spaces when reading from standard in and without any options:

$ cat file | wc
  5       5      65

To remove all leading spaces, in POSIX shell:

set -f
set -- $nl
nl=$1
set +f

Note that this approach assume that variable only contain leading or trailing spaces, no spaces in the middle, like a b.

Related Solutions

Unix – Get Characters 10 to 80 in a File

I wonder how the line feed in the file should be handled. Does that count as a character or not?

If we just should take from byte 10 and print 71 bytes (A,C,T,G and linefeed) then Sato Katsura solution is the fastest (here assuming GNU dd or compatible for status=none, replace with 2> /dev/null (though that would also hide error messages if any) with other implementations):

 dd if=file bs=1 count=71 skip=9 status=none

If the line feed should be skipped then filter them out with tr -d '\n':

 tr -d '\n' < file | dd bs=1 count=70 skip=9 status=none

If the Fasta-header should be skipped it is:

 grep -v '^[;>]' file | tr -d '\n' | dd bs=1 count=70 skip=9 status=none

grep -v '^[;>]' file means skip all lines that start with ; or >.

Finding number of lines using find command

The first command you mention, find . -type f -exec wc -l {} +, really says "run wc -l on as many files as possible, until all of them have been processed". This can run wc multiple times!

On the other hand, find . -type f -exec cat {} + | wc -l can run cat several times, but will only run wc once. (More in detail, this is because in this case cat is called by find, which can and does decide to run it however many times it wants, whereas the part after the pipe character, wc -l, is beyond the reach of find, and is therefore run by your shell, just once.)

You say that the first command "yields 394968", but it really does not; on my system its output ends with:

(Many more lines elided...)
     23 ./po/Makefile.win
     64 ./po/README
      1 ./VERSION-NICK
     97 ./README
 258450 total

Yet, by adding grep total, one can see that wc was really run twice:

$ find . -type f -exec wc -l {} + | grep total
 1590407 total
 258450 total

And, indeed, 1590407 plus 258450 is 1848857, which agrees with the second command.

An explanation of why wc was run more than once in the find -exec wc + version of the command is vaguely hinted at by the find man page:

-exec command {} +

This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total number of invocations of the command will be much less than the number of matched files. The command line is built in much the same way that xargs builds its command lines.

Note how this says "much less than ..." rather than "only once". The documentation for xargs hints that its option --max-chars is set automatically if not set by the user:

--max-chars=max-chars
-s max-chars

Use at most max-chars characters per command line, including the command and initial-arguments and the terminating nulls at the ends of the argument strings. The largest allowed value is system-dependent, and is calculated as the argument length limit for exec, less the size of your environment, less 2048 bytes of headroom. If this value is more than 128KiB, 128Kib is used as the default value; otherwise, the default value is the maximum.

This limits how many filenames can be passed to a single call to wc, explaining why, for large numbers of files, several calls to wc will occur, each operating on a partition of the input.

Demo

Best Answer

Related Solutions

Unix – Get Characters 10 to 80 in a File

Finding number of lines using find command

Related Question