I wonder how the line feed in the file should be handled. Does that count as a character or not?
If we just should take from byte 10 and print 71 bytes (A,C,T,G and linefeed) then Sato Katsura solution is the fastest (here assuming GNU dd
or compatible for status=none
, replace with 2> /dev/null
(though that would also hide error messages if any) with other implementations):
dd if=file bs=1 count=71 skip=9 status=none
If the line feed should be skipped then filter them out with tr -d '\n'
:
tr -d '\n' < file | dd bs=1 count=70 skip=9 status=none
If the Fasta-header should be skipped it is:
grep -v '^[;>]' file | tr -d '\n' | dd bs=1 count=70 skip=9 status=none
grep -v '^[;>]' file
means skip all lines that start with ;
or >
.
The first command you mention, find . -type f -exec wc -l {} +
,
really says "run wc -l
on as many files as possible, until all of
them have been processed". This can run wc
multiple times!
On the other hand, find . -type f -exec cat {} + | wc -l
can run
cat
several times, but will only run wc
once. (More in detail,
this is because in this case cat
is called by find
, which can and
does decide to run it however many times it wants, whereas the part
after the pipe character, wc -l
, is beyond the reach of find
, and
is therefore run by your shell, just once.)
You say that the first command "yields 394968", but it really does
not; on my system its output ends with:
(Many more lines elided...)
23 ./po/Makefile.win
64 ./po/README
1 ./VERSION-NICK
97 ./README
258450 total
Yet, by adding grep total
, one can see that wc
was really run twice:
$ find . -type f -exec wc -l {} + | grep total
1590407 total
258450 total
And, indeed, 1590407 plus 258450 is 1848857, which agrees with the second command.
An explanation of why wc
was run more than once
in the find -exec wc +
version of the command
is vaguely hinted at by the find man page:
-exec command {} +
This variant of the -exec
action runs the specified command on
the selected files, but the command line is built by appending
each selected file name at the end;
the total number of invocations of the command
will be much less than the number of
matched files. The command line is built in much the same way
that xargs
builds its command lines.
Note how this says "much less than ..." rather than "only once". The
documentation for xargs hints that its option --max-chars
is set
automatically if not set by the user:
--max-chars=max-chars
-s max-chars
Use at most max-chars
characters per command line, including the
command and initial-arguments and the terminating nulls at the
ends of the argument strings.
The largest allowed value is system-dependent,
and is calculated as the argument length limit
for exec, less the size of your environment, less 2048 bytes of
headroom. If this value is more than 128KiB, 128Kib is used as
the default value; otherwise, the default value is the maximum.
This limits how many filenames can be passed to a single call to wc
,
explaining why, for large numbers of files, several calls to wc
will
occur, each operating on a partition of the input.
Best Answer
As POSIX defined, the output of
wc
shall contain an entry for each input file of the form:But the output file format pseudo
printf()
string differs from the System V version ofwc
:POSIX didn't require leading spaces to be added, so it's free for implementation to do what it want. There are different implementations of
wc
, at least with OSX andwc
from heirloom tools chest, it added leading spaces to output.GNU wc also add leading spaces when reading from standard in and without any options:
To remove all leading spaces, in POSIX shell:
Note that this approach assume that variable only contain leading or trailing spaces, no spaces in the middle, like a b.