Linux – Best Methods to Copy a CD/DVD: cat, dd, pv, or Others?

catdddisk-imagelinuxpv

Background

I'm copying some data CDs/DVDs to ISO files to use them later without the need of them in the drive.

I'm looking on the Net for procedures and I found a lot:

Use of cat to copy a medium: http://www.yolinux.com/TUTORIALS/LinuxTutorialCDBurn.html
```
cat /dev/sr0 > image.iso
```
Use of dd to do so (apparently the most widely used): http://www.linuxjournal.com/content/archiving-cds-iso-commandline
```
dd if=/dev/cdrom bs=blocksize count=count of=/path/to/isoimage.iso
```

Use of just pv to accomplish this: See man pv for more information, although here's an excerpt of it:

Taking an image of a disk, skipping errors:
      pv -EE /dev/sda > disk-image.img

Writing an image back to a disk:
      pv disk-image.img > /dev/sda

Zeroing a disk:
      pv < /dev/zero > /dev/sda

I don't know if all of them should be equivalent, although I tested some of them (using the md5sum tool) and, at least, dd and pv are not equivalent. Here's the md5sum of both the drive and generated files using each procedure:

md5 of dd procedure: 71b676875b0194495060b38f35237c3c

md5 of pv procedure: f3524d81fdeeef962b01e1d86e6acc04

EDIT: That output was from another CD than the output given. In fact, I realized there are some interesting facts I provide as an answer.

In fact, the size of each file is different comparing to each other.

So, is there a best procedure to copy a CD/DVD or am I just using the commands incorrectly?

More information about the situation

Here is more information about the test case I'm using to check the procedures I've found so far:

isoinfo -d i /dev/sr0
Output: https://gist.github.com/JBFWP286/7f50f069dc5d1593ba62#file-isoinfo-output-19-aug-2015

dd to copy the media, with output checksums and file information
Output: https://gist.github.com/JBFWP286/75decda0a67605590d32#file-dd-output-with-md5-and-sha256-19-aug-2015

pv to copy the media, with output checksums and file information
Output: https://gist.github.com/JBFWP286/700a13fe0a2f06ce5e7a#file-pv-output-with-md5-and-sha256-19-aug-2015

Any help will be appreciated!

Best Answer

All of the following commands are equivalent. They read the bytes of the CD /dev/sr0 and write them to a file called image.iso.

cat /dev/sr0 >image.iso
cat </dev/sr0 >image.iso
tee </dev/sr0 >image.iso
dd </dev/sr0 >image.iso
dd if=/dev/cdrom of=image.iso
pv </dev/sr0 >image.iso
cp /dev/sr0 image.iso
tail -c +1 /dev/sr0 >image.iso

Why would you use one over the other?

Simplicity. For example, if you already know cat or cp, you don't need to learn yet another command.
Robustness. This one is a bit of a variant of simplicity. How much risk is there that changing the command is going to change what it does? Let's see a few examples:
- Anything with redirection: you might accidentally put a redirection the wrong way round, or forget it. Since the destination is supposed to be a non-existing file, set -o noclobber should ensure that you don't overwrite anything; however you might overwrite a device if you accidentally write >/dev/sda (for a CD, which is read-only, there's no risk, of course). This speaks in favor of cat /dev/sr0 >image.iso (hard to get wrong in a damaging way) over alternatives such as tee </dev/sr0 >image.iso (if you invert the redirections or forget the input one, tee will write to /dev/sr0).
- cat: you might accidentally concatenate two files. That leaves the data easily salvageable.
- dd: i and o are close on the keyboard, and somewhat unusual. There's no equivalent of noclobber, of= will happily overwrite anything. The redirection syntax is less error-prone.
- cp: if you accidentally swap the source and the target, the device will be overwritten (again, assuming a non read-only device). If cp is invoked with some options such as -R or -a which some people add via an alias, it will copy the device node rather than the device content.
Additional functionality. The one tool here that has useful additional functionality is pv, with its powerful reporting options.
But here you can check how much has been copied by looking at the size of the output file anyway.
Performance. This is an I/O-bound process; the main influence in performance is the buffer size: the tool reads a chunk from the source, writes the chunk to the destination, repeats. If the chunk is too small, the computer spends its time switching between tasks. If the chunk is too large, the read and write operations can't be parallelized. The optimal chunk size on a PC is typically around a few megabytes but this is obviously very dependent on the OS, on the hardware, and on what else the computer is doing. I made benchmarks for hard disk to hard disk copies a while ago, on Linux, which showed that for copies within the same disk, dd with a large buffer size has the advantage, but for cross-disk copies, cat won over any dd buffer size.

There are a few reasons why you find dd mentioned so often. Apart from performance, they aren't particularly good reasons.

In very old Unix systems, some text processing tools couldn't cope with binary data (they used null-terminated strings internally, so they tended to have problems with null bytes; some tools also assumed that characters used only 7 bits and didn't process 8-bit character sets properly). I'm not sure if this ever was a problem with cat (it was with more line-oriented tools such as head, sed, etc.), but people tended to avoid it on binary data because of its association with text processing. This is not a problem on modern systems such as Linux, OSX, *BSD, or anything that's POSIX-compliant.
There's a sort of myth that dd is somewhat “lower level” than other tools such as cat and accesses devices directly. This is completely false: dd and cat and tee and the others all read bytes from their input and write the bytes to their output. The real magic is in /dev/sr0.
dd has an unusual command line syntax, so explaining how it works gives more of an opportunity to shine by explaining something that just writing cat /dev/sr0.
Using dd with a large buffer size can have better performance, but it is not always the case (see some benchmarks on Linux).

A major risk with dd is that it can silently skip some data. I think dd is safe as long as skip or count are not passed but I'm not sure whether this is the case on all platforms. But it has no advantage except for performance.

So just use pv if you want its fancy progress report, or cat if you don't.

Related Solutions

Linux – How to take sha-1, sha-256 or MD5 of CDs / DVDs

In addition to Gilles answer,

If you still have the ISO image, you could use cmp instead of checksums. It would tell you at which byte the difference happens. It would also make the check faster as if there is an error early on, it would tell you right away, whereas the checksum always has to read the entire media.

$ cmp /dev/cdrom /path/to/cdrom.iso

In case of error it should print something like this

/dev/cdrom /path/to/cdrom.iso differ, byte 123456789, line 42

In case it's correct it should print nothing, or this:

cmp: EOF on /path/to/cdrom.iso

Which means there is more data on /dev/cdrom than in the ISO, most likely zero-padding.

Even before starting any comparisons, you could check the size.

$ blockdev --getsize64 /dev/cdrom
123456999
$ stat -c %s /path/to/cdrom.iso
123456789

If it's identical, the checksum should match also. If /dev/cdrom is larger, it should be zero padded at the end. You could check that with hexdump. Use the ISO size for the -s parameter.

$ hexdump -s 15931539256 -C /dev/cdrom
3b597ff38  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
3b597fff8  00 00 00 00 00 00 00 00                           |........|

hexdump is also useful for having a look at difference at any other position in a file, in case a damage was caused deliberately by something.

Cat – How Does Cat Determine Optimum Block Size?

The main loop of GNU cat, in the simplest case is (function simple_cat from cat.c):

while (true)
    {
        /* Read a block of input. */
        n_read = safe_read (input_desc, buf, bufsize);

        /* ... */
    }

Then the question becomes "how is bufsize set?" The answer is it's using io_blksize (insize = io_blksize (stat_buf)), which is defined as follows:

io_blksize (struct stat sb)
{
  return MAX (IO_BUFSIZE, ST_BLKSIZE (sb));
}

where ST_BLKSIZE gives the operating system's idea of the file system's preferred I/O block size (as accessed using stat), and IO_BUFSIZE is defined as 128*1024 (128KB). Here is an excerpt of the Linux stat syscall documentation:

blksize_t st_blksize; /* blocksize for file system I/O */ (...)

The st_blksize field gives the "preferred" blocksize for efficient
file system I/O.   (Writing to a file in smaller  chunks may cause
an inefficient read-modify-rewrite.)

So it seems that GNU cat will read in blocks of 128KB or the file system's recommended I/O block size, whichever is larger.