DD – Create Random Data and Handle Partial Read Warning

ddrandom

I create a 1TB file with random data with dd if=/dev/urandom of=file bs=1M count=1000000. Now I check with kill -SIGUSR1 <PID> the progress and get the following:

691581+0 Datensätze ein
691580+0 Datensätze aus
725174190080 Bytes (725 GB) kopiert, 86256,9 s, 8,4 MB/s
800950+1 Datensätze ein
800950+0 Datensätze aus
839856947200 Bytes (840 GB) kopiert, 99429,5 s, 8,4 MB/s
dd: warning: partial read (809620 bytes); suggest iflag=fullblock
803432+1 Datensätze ein
803431+1 Datensätze aus
842459273876 Bytes (842 GB) kopiert, 99791,3 s, 8,4 MB/s

I can't interpret the warning. What does it say? Is my file really random after the warning or is there a problem?
What does +0 or +1 in 800950+1 Datensätze ein and 800950+0 Datensätze aus mean? After the warning it is +1. Is it a errorcount?

Best Answer

Summary: dd is a cranky tool which is hard to use correctly. Don't use it, despite the numerous tutorials that tell you so. dd has a “unix street cred” vibe attached to it — but if you truly understand what you're doing, you'll know that you shouldn't be touching it with a 10-foot pole.

dd makes a single call to the read system call per block (defined by the value of bs). There is no guarantee that the read system call returns as much data as the specified buffer size. This tends to work for regular files and block devices, but not for pipes and some character devices. See When is dd suitable for copying data? (or, when are read() and write() partial) for more information. If the read system call returns less than one full block, then dd transfers a partial block. It still copies the specified number of blocks, so the total amount of transfered bytes is less than requested.

The warning about a “partial read” tells you exactly this: one of the reads was partial, so dd transfered an incomplete block. In the block counts, +1 means that one block was read partially; since the output count is +0, all blocks were written out as read.

This doesn't affect the randomness of the data: all the bytes that dd writes out are bytes that it read from /dev/urandom. But you got fewer bytes than expected.

Linux's /dev/urandom accommodates arbitrary large requests (source: extract_entropy_user in drivers/char/random.c), so dd is normally safe when reading from it. However, reading large amounts of data takes time. If the process receives a signal, the read system call returns before filling its output buffer. This is normal behavior, and applications are supposed to call read in a loop; dd doesn't do this, for historical reasons (dd's origins are murky, but it seems to have started out as a tool to access tapes, which have peculiar requirements, and was never adapted to be a general-purpose tool). When you check the progress, this sends the dd process a signal which interrupts the read. You have a choice between knowing how many bytes dd will copy in total (make sure not to interrupt it — no progress check, no suspension), or knowing how many bytes dd has copied so far, in which case you can't know how many more bytes it will copy.

The version of dd in GNU coreutils (as found on non-embedded Linux and on Cygwin) has a flag fullblock which tells dd to call read in a loop (and ditto for write) and thus always transfer full blocks. The error message suggests that you use it; you should always use it (in both input and output flags), except in very special circumstances (mostly when accessing tapes) — if you use dd at all, that is: there are usually better solutions (see below).

dd if=/dev/urandom iflag=fullblock oflag=fullblock of=file bs=1M count=1000000

Another possible way to be sure of what dd will do is to pass a block size of 1. Then you can tell how many bytes were copied from the block count, though I'm not sure what will happen if a read is interrupted before reading the first byte (which is not very likely in practice but can happen). However, even if it works, this is very slow.

The general advice on using dd is do not use dd. Although dd is often advertised as a low-level command to access devices, it is in fact no such thing: all the magic happens in the device file (the /dev/…) part, dd is just an ordinary tool with a high potential for misuse resulting in data loss. In most cases, there is a simpler and safer way to do what you want, at least on Linux.

For example, to read a certain number of bytes at the beginning of a file, just call head:

head -c 1000000m </dev/urandom >file

I made a quick benchmark on my machine and did not observe any performance difference between dd with a large block size and head.

If you need to skip some bytes at the beginning, pipe tail into head:

dd if=input of=output count=C bs=B seek=S
<input tail -c +$((S*B+1)) | head -c $((C*B)) >output

If you want to see progress, call lsof to see the file offset. This only works on a regular file (the output file on your example), not on a character device.

lsof -a -p 1234 -d 1
cat /proc/1234/fdinfo/1

You can call pv to get a progress report (better than dd's), at the expense of an additional item in the pipeline (performance-wise, it's barely perceptible).

Related Solutions

Is /dev/random data a psuedo-random AES cypher, and where does the entropy come from

Linux has two random number generators available to userspace, /dev/random and /dev/urandom.

/dev/random is a source of "true" randomness - i.e. it is not generated by a pseudo-random number generator. Entropy is fed into this by the input driver and the interrupt handler, through the functions add_input_randomness and add_interrupt_randomness. Processes reading this device will block if the entropy runs out.

/dev/urandom is a pseudo-random number generator. It is fed by the same entropy pool as /dev/random, but when that runs out, it switches to a cryptographically strong generator.

Userspace applications can feed into the entropy pool by writing to /dev/{,u}random.

Have a read of the random(4) manual page, and the file drivers/char/random.c in the kernel source tree. It is well commented and most of what you ask is explained there.

FreeBSD's /dev/random by default is a pseudo-random number generator using the Yarrow algorithm (but can point to a hardware RNG if one is connected). The software generator takes entropy from Ethernet and serial connections and hardware interrupts (changeable through sysctl kern.random). The Yarrow algorithm is believed to be secure as long as the internal state is unknown, therefore /dev/random should always output high-quality data without blocking. See random(4).

On NetBSD, /dev/random provides random data based only on entropy collected (from disks, network, input devices, and/or tape drives; adjustable using rndctl), while /dev/urandom falls back to a PRNG when the entropy pool is empty, similar to Linux. See random(4), rndctl(8), rnd(9).

OpenBSD has four generators: /dev/random is a hardware generator, /dev/srandom is a secure random data generator (using MD5 on the entropy pool: "disk and network device interrupts and such"), /dev/urandom is similar but falls back to a PRNG when the entropy pool is empty. The fourth, /dev/arandom, is also a PRNG but using RC4. See random(4), arc4random(3).

Mac OS X also uses the Yarrow algorithm for /dev/random, but has an identically working /dev/urandom for compatibility. "Additional entropy is fed to the generator regularly by the SecurityServer daemon from random jitter measurements of the kernel." See random(4).

When is dd suitable for copying data? (or, when are read() and write() partial)

From the spec:

If the bs=expr operand is specified and no conversions other than sync, noerror, or notrunc are requested, the data returned from each input block shall be written as a separate output block; if the read() returns less than a full block and the sync conversion is not specified, the resulting output block shall be the same size as the input block.

So this is probably what causes your confusion. Yes, because dd is designed for blocking, by default partial read()s will be mapped 1:1 to partial write()s, or else syncd out on tail padding NUL or space chars to bs= size when conv=sync is specified.

This means that dd is safe to use for copying data (w/ no risk of corruption due to a partial read or write) in every case but one in which it is arbitrarily limited by a count= argument, because otherwise dd will happily write() its output in identically sized blocks to those in which its input was read() until it read()s completely through it. And even this caveat is only true when bs= is specified or obs= is not specified, as the very next sentence in the spec states:

If the bs=expr operand is not specified, or a conversion other than sync, noerror, or notrunc is requested, the input shall be processed and collected into full-sized output blocks until the end of the input is reached.

Without ibs= and/or obs= arguments this can't matter - because ibs and obs are both the same size by default. However, you can get explicit about input buffering by specifying different sizes for either and not specifying bs= (because it takes precedence).

For example, if you do:

IN| dd ibs=1| OUT

...then a POSIX dd will write() in chunks of 512 bytes by collecting every singly read() byte into a single output block.

Otherwise, if you do...

IN| dd obs=1kx1k| OUT

...a POSIX dd will read() at maximum 512 bytes at a time, but write() every megabyte-sized output block (kernel allowing and excepting possibly the last - because that's EOF) in full by collecting input into full-sized output blocks.

Also from the spec, though:

count=n
- Copy only n input blocks.

count= maps to i?bs= blocks, and so in order to handle an arbitrary limit on count= portably you'll need two dds. The most practical way to do it with two dds is by piping the output of one into the input of another, which surely puts us in the realm of reading/writing a special file regardless of the original input type.

An IPC pipe means that when specifying [io]bs= args that, to do so safely, you must keep such values within the system's defined PIPE_BUF limit. POSIX states that the system kernel must only guarantee atomic read()s and write()s within the limits of PIPE_BUF as defined in limits.h. POSIX guarantees that PIPE_BUF be at least ...

{_POSIX_PIPE_BUF}
- Maximum number of bytes that is guaranteed to be atomic when writing to a pipe.
- Value: 512

...(which also happens to be the default dd i/o blocksize), but the actual value is usually at least 4k. On an up-to-date linux system it is, by default, 64k.

So when you setup your dd processes you should do it on a block factor based on three values:

bs = ( obs = PIPE_BUF or lesser )
n = total desired number of bytes read
count = n / bs

Like:

yes | dd obs=1k | dd bs=1k count=10k of=/dev/null
10240+0 records in
10240+0 records out
10485760 bytes (10 MB) copied, 0.1143 s, 91.7 MB/s

You have to synchronize i/o w/ dd to handle non-seekable inputs. In other words, make pipe-buffers explicit and they cease to be a problem. That's what dd is for. The unknown quantity here is yes's buffer size - but if you block that out to a known quantity with another dd then a little informed multiplication can make dd safe to use for copying data (w/ no risk of corruption due to a partial read or write) even when arbitrarily limiting input w/ count= w/ any arbitrary input type on any POSIX system and without missing a single byte.

Here's a snippet from the POSIX spec:

ibs=expr
- Specify the input block size, in bytes, by expr (default is 512).
obs=expr
- Specify the output block size, in bytes, by expr (default is 512).
bs=expr
- Set both input and output block sizes to expr bytes, superseding ibs= and obs=. If no conversion other than sync, noerror, and notrunc is specified, each input block shall be copied to the output as a single block without aggregating short blocks.

You'll also find some of this explained better here.

Best Answer

Related Solutions

Is /dev/random data a psuedo-random AES cypher, and where does the entropy come from

When is dd suitable for copying data? (or, when are read() and write() partial)

Related Question