Compression – On-the-Fly Stream Compression Without Hardware Overload

compressionddgzippipe

I have 200 GB free disk space, 16 GB of RAM (of which ~1 GB is occupied by the desktop and kernel) and 6 GB of swap.

I have a 240 GB external SSD, with 70 GB used1 and the rest free, which I need to back up to my disk.

Normally, I would dd if=/dev/sdb of=Desktop/disk.img the disk first, and then compress it, but making the image first is not an option since doing so would require far more disk space than I have, even though the compression step will result in the free space being squashed so the final archive can easily fit on my disk.

dd writes to STDOUT by default, and gzip can read from STDIN, so in theory I can write dd if=/dev/sdb | gzip -9 -, but gzip takes significantly longer to read bytes than dd can produce them.

From man pipe:

Data written to the write end of the pipe is buffered by the kernel until it is read from the read end of the pipe.

I visualise a | as being like a real pipe — one application shoving data in and the other taking data out of the pipe's queue as quickly as possible.

What when the program on the left side writes more data more quickly than the other side of the pipe can hope to process it? Will it cause extreme memory or swap usage, or will the kernel try to create a FIFO on disk, thereby filling up the disk? Or will it just fail with SIGPIPE Broken pipe if the buffer is too large?

Basically, this boils down to two questions:

  1. What are the implications and outcomes of shoving more data into a pipe than is read at a time?
  2. What's the reliable way to compress a datastream to disk without putting the entire uncompressed datastream on the disk?

Note 1: I cannot just copy exactly the first 70 used GB and expect to get a working system or filesystem, because of fragmentation and other things which will require the full contents to be intact.

Best Answer

Technically you don't even need dd:

gzip < /dev/drive > drive.img.gz

If you do use dd, you should always go with larger than default blocksize like dd bs=1M or suffer the syscall hell (dd's default blocksize is 512 bytes, since it read()s and write()s that's 4096 syscalls per MiB, too much overhead).

gzip -9 uses a LOT more CPU with very little to show for it. If gzip is slowing you down, lower the compression level, or use a different (faster) compression method.

If you're doing file based backups instead of dd images, you could have some logic that decides whether to compress at all or not (there's no point in doing so for various file types). dar (tar alternative`) is one example that has options to do so.

If your free space is ZERO (because it's an SSD that reliably returns zero after TRIM and you ran fstrim and dropped caches) you can also use dd with conv=sparse flag to create an uncompressed, loop-mountable, sparse image that uses zero disk space for the zero areas. Requires the image file to be backed by a filesystem that supports sparse files.

Alternatively for some filesystems there exist programs able to only image the used areas.

Related Question