I have 200 GB free disk space, 16 GB of RAM (of which ~1 GB is occupied by the desktop and kernel) and 6 GB of swap.
I have a 240 GB external SSD, with 70 GB used1 and the rest free, which I need to back up to my disk.
Normally, I would dd if=/dev/sdb of=Desktop/disk.img
the disk first, and then compress it, but making the image first is not an option since doing so would require far more disk space than I have, even though the compression step will result in the free space being squashed so the final archive can easily fit on my disk.
dd
writes to STDOUT by default, and gzip
can read from STDIN, so in theory I can write dd if=/dev/sdb | gzip -9 -
, but gzip
takes significantly longer to read bytes than dd
can produce them.
From man pipe
:
Data written to the write end of the pipe is buffered by the kernel until it is read from the read end of the pipe.
I visualise a |
as being like a real pipe — one application shoving data in and the other taking data out of the pipe's queue as quickly as possible.
What when the program on the left side writes more data more quickly than the other side of the pipe can hope to process it? Will it cause extreme memory or swap usage, or will the kernel try to create a FIFO on disk, thereby filling up the disk? Or will it just fail with SIGPIPE Broken pipe
if the buffer is too large?
Basically, this boils down to two questions:
- What are the implications and outcomes of shoving more data into a pipe than is read at a time?
- What's the reliable way to compress a datastream to disk without putting the entire uncompressed datastream on the disk?
Note 1: I cannot just copy exactly the first 70 used GB and expect to get a working system or filesystem, because of fragmentation and other things which will require the full contents to be intact.
Best Answer
Technically you don't even need
dd
:If you do use
dd
, you should always go with larger than default blocksize likedd bs=1M
or suffer the syscall hell (dd
's default blocksize is 512 bytes, since itread()
s andwrite()
s that's4096
syscalls perMiB
, too much overhead).gzip -9
uses a LOT more CPU with very little to show for it. Ifgzip
is slowing you down, lower the compression level, or use a different (faster) compression method.If you're doing file based backups instead of
dd
images, you could have some logic that decides whether to compress at all or not (there's no point in doing so for various file types).dar
(tar
alternative`) is one example that has options to do so.If your free space is ZERO (because it's an SSD that reliably returns zero after TRIM and you ran
fstrim
and dropped caches) you can also usedd
withconv=sparse
flag to create an uncompressed, loop-mountable, sparse image that uses zero disk space for the zero areas. Requires the image file to be backed by a filesystem that supports sparse files.Alternatively for some filesystems there exist programs able to only image the used areas.