Bash – Writing to Disk and Compressing with xz Simultaneously

bashpipexz

I have a program that writes traces on disk and the size becomes very large. Normally, I use the following commands.

./run output.txt
xz output.txt

Can I pipe xz at the same time as output.txt is being written?

I read How to convert all files from gzip to xz on the fly (and recursively)?, but I am not sure it applies in my case.

Best Answer

If your ./run will produce its output to stdout if not given a file argument (which is customary in Unix/Linux), then you can simply use:

./run | xz -c >output.txt.xz

If it needs a filename argument, but if it's fine writing to a pipe, then you can either use a special device such as /dev/stdout or /dev/fd/1 (both should be equivalent), like so:

./run /dev/stdout | xz -c >output.txt.xz

Or you can use process substitution, which is typically available in most modern shells such as bash, zsh, or ksh, which will end up using a device from /dev/fd behind the scenes to accomplish the same:

./run >(xz -c >output.txt.xz)

This last one also needs ./run to be able to write to a pipe, but it should work better than the others if ./run writes to output.txt and to stdout in its normal operation, in which case the output would get mixed up if you redirect both to stdout.

Programs are usually ok writing to a pipe, but some of them might want to seek and rewind to offsets within an output file, which is not possible in a pipe. If that's the case, then writing to a temporary file and then compressing it is probably all you can do.

Related Solutions

How to convert all files from gzip to xz on the fly (and recursively)

find . -name '*.gz' -type f -exec bash -o pipefail -Cc '
  for file do
    gunzip < "$file" | xz > "${file%.gz}.xz" && rm -f "$file"
  done' bash {} +

The -C prevents overwriting an existing file and won't follow symlinks except if the exiting file is a non-regular file or a link to a non-regular file, so you would not lose data unless you have for instance a file.gz and a file.xz that is a symlink to /dev/null. To guard against that, you could use zsh instead and also use the -execdir feature of some find implementations for good measure and avoid some race conditions:

find . -name '*.gz' -type f -execdir zsh -o pipefail -c '
  zmodload zsh/system || exit
  for file do
    gunzip < "$file" | (
      sysopen -u 1 -w -o excl -- "${file%.gz}.xz" && xz) &&
      rm -f -- "$file"
  done' zsh {} +

Or to clean-up xz files upon failed recompressions:

find . -name '*.gz' -type f -execdir zsh -o pipefail -c '
  zmodload zsh/system || exit
  for file do
    sysopen -u 1 -w -o excl -- "${file%.gz}.xz" &&
      if gunzip < "$file" | xz; then
        rm -f -- "$file"
      else
        rm -f -- "${file%.gz}.xz"
      fi
  done' zsh {} +

If you'd rather it being short, and are ready to ignore some of those potential issues, in zsh, you could do

for f (./**/*.gz(D.)) {gunzip < $f | xz > $f:r.xz && rm -f $f}

Linux – Are pipe reads not greater than PIPE_BUF atomic

Looking at the source code, the implementation of pipe_read in source/fs/pipe.c has changed quite a bit in the Linux kernel, but from a quick reading of the code in 2.0.40, 2.4.37, 2.6.32, 3.11 and 4.9, it seems to me that whenever there has been (or is, while read is blocking) a write of size w and a read of size r with r > w then read will return at least w bytes. So if you have fixed-size chunks (of a size smaller than PIPE_BUF) and always make reads of that same size, then you are in practice guaranteed to always read a whole chunk.

On the other hand, if you have variable-sized chunks, then you have no such guarantee. There is a guarantee of atomicity only on the write side: a write of less than PIPE_BUF will not be cut by another writer. But on the reader side, if there have been e.g. a write of 10 bytes followed by a write of 20 bytes, and you later try to read 15 bytes, then you'll get the complete first write and the first 5 bytes of the second write. The read call doesn't stop reading data until it would have to block or its output buffer is full.

If you want to transmit data in chunks, use a datagram socket instead of a pipe.

Best Answer

Related Solutions

How to convert all files from gzip to xz on the fly (and recursively)

Linux – Are pipe reads not greater than PIPE_BUF atomic

Related Question