Shell – How to use GNU parallel to calculate sha256 hash

gnu-parallelhashsumshell-script

Based on this:
Simultaneously calculate multiple digests (md5, sha256)?

I have a folder that has a large number of files that I want to compute the SHA256 hash for.

I used to code segment:

#!/bin/bash
for file in *; do
sha256sum "$file" > "$file".sha &
done

currently to compute the sha256 hash in parallel, except that my computer only has 16 physical cores.

So, the question that I have is how can I use GNU parallel to run this, but only run using the 16 physical cores that I have available on my system and that once a hash has been completed, it will automatically pick up the next file to hash?

Best Answer

Using xargs (and assuming that you have an implementation of this utility that supports -0 and -P):

printf '%s\0' * | xargs -0 -L 1 -P 16 sh -c 'sha256sum "$1" > "$1".sha' sh

This would pass all names in the current directory as a nul-terminated list to xargs. The xargs utility would call an in-line sh script for each one of these names, starting at most 16 concurrent processes. The in-line script takes the argument and runs sha256sum on it, outputting the result to a file of a similar name.

Note that this would also possibly pick up .sha files created in a previous run of the same pipeline. To avoid this, use a slightly more sophisticated glob than * to match the particular names that you'd want to process. For example, in bash:

shopt -s extglob
printf '%s\0' !(*.sha) | xargs ...as above...

Note also that running sha256sum on large files in parallel is likely to be disk bound rather than CPU bound and that you may possibly see similar speed of operation with a smaller number of parallel tasks.

For a GNU parallel equivalent, replace xargs with parallel.

In the zsh shell, you can do it like

autoload -U zargs
setopt EXTENDED_GLOB

zargs -P 16 -L 1 -- (^(*.sha)) -- sh -c 'sha256sum "$1" > "$1".sha' sh

Related Solutions

Hashsum Calculation – How to Simultaneously Calculate Multiple Digests (md5, sha256)

Check out pee ("tee standard input to pipes") from moreutils. This is basically equivalent to Marco's tee command, but a little simpler to type.

$ echo foo | pee md5sum sha256sum
d3b07384d113edec49eaa6238ad5ff00  -
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c  -

$ pee md5sum sha256sum <foo.iso
f109ffd6612e36e0fc1597eda65e9cf0  -
469a38cb785f8d47a0f85f968feff0be1d6f9398e353496ff7aa9055725bc63e  -

Linux – Different hash value of large rsynced file on centos and ubuntu

There's a wrong assumption here:

As far as I know rsync automatically verifies the transfer went well with hash checks after the transfer is completed.

Rsync uses checksums to determine if a sync is needed. But, Rsync does not reread the created copy, it trust the kernel to report errors. So, the conclusion is simple: the files are not identical. Could be just one bit, could be more. How much mismatch there is, a checksum doesn't tell you.

Best Answer

Related Solutions

Hashsum Calculation – How to Simultaneously Calculate Multiple Digests (md5, sha256)

Linux – Different hash value of large rsynced file on centos and ubuntu

Related Question