Shell – How to use namedpipe as temporary file

pipeshell-script

One vim plugin that I use uses this script to pass some input to linters that don't support reading from stdin.

set -eu

# All of the following arguments are read as command to run.
file_extension="$1"
shift

temp_dir=$(mktemp -d 2>/dev/null || mktemp -d -t 'ale_linter')
temp_file="$temp_dir/file$file_extension"
trap 'rm -r "$temp_dir"' EXIT

while read -r; do
  echo "$REPLY" >> "$temp_file"
done

"$@" "$temp_file"

At first I was a bit confused about why they haven't just used something like that

some input | some_program /dev/stdin

But after trying ghc as linter I've found it that it's complaining about /dev/stdin saying something about that it isn't real file (which it isn't)

So I wonder can I use namedpipe instead of temporary file.
The reason I'm not quite satisfied with writing to temporary files is SSD health and if there is a better way to do it why not do it, right?

Best Answer

No, the programs that reject those files usually reject them on the ground that the file is not seekable (they need to access the content at arbitrary offsets, or several times after rewinding etc.). Or they would want to open the file several times. They may also want to rewrite (part of) the file or truncate it.

Unnamed pipe (like with | and /dev/stdin) or named ones make no difference in any of those cases.

Actually, on Linux, /dev/stdin when stdin is pipe (named or not) behaves exactly like a named pipe, the program would not be able to differentiate that /dev/stdin from a real named pipe.

On other systems, it's not exactly the same, but in effect, opening /dev/stdin or a named pipe will get you a file descriptor to a pipe, something that is not seekable either way.

So, you'll need to create the temporary file. Note that some shells make it easier. With zsh, it's just:

#! /bin/zsh -
"$@" =(cat)

On Linux and With shells that use a deleted temporary files for here documents (like bash, zsh and some implementations of ksh), you can do:

#! /bin/bash -
"$@" /dev/fd/3 3<< EOF
$(cat)
EOF

However, that may mangle the contents of the file if it contains NUL characters or ends in empty lines.

Note that since version 5, bash makes the here doc temporary file read-only, so if the application needs to make modifications to that file, you'll to restore the write permissions with:

#! /bin/bash -
{
  chmod u+w /dev/fd/3 && # only needed in bash 5+
    "$@" /dev/fd/3
} 3<< EOF
$(cat)
EOF

A note about that while read loop since you asked.

First read -r without a variable name is not valid sh syntax. The sh syntax is specified by POSIX (ISO 9945, also IEEE Std 1003.1) like the C syntax is specified by ISO 9899.

In that specification, you'll notice that read requires a variable name argument. The behaviour when you omit it is unspecified and in practice vary with the sh interpreter implementation.

bash is the GNU sh interpreter, like gcc is the GNU C compiler. Both bash and gcc have extensions over what those standards specify.

In the case of read, bash treats read -r as if it was IFS= read -r REPLY. In the POSIX spec, IFS= read -r REPLY reads stdin until either a \n character or the end of input is reached and stores the read characters into the $REPLY variable and returns with a success exit status if a newline character was read (a full line) or failure otherwise (like EOF before the newline) and leaves the behaviour undefined if the read data contains NUL characters or sequences of bytes that don't form valid characters.

In the case of bash, it will store the bytes read even if they don't form valid characters and removes the NUL characters.

read -r is like read -r REPLY in ksh or zsh and reports an error in yash or ash-based POSIX-like shells.

The behaviour of echo is unspecified unless its arguments don't contain backslash characters and the first one is not -n.

So, to sum up, unless you know the particular sh implementation (and version) you're dealing with, you can't tell what

while read -r; do
  echo "$REPLY" >> "$temp_file"
done

will do. In the case of bash specifically, it will store stdin into the temp_file only as long as the data doesn't contain NUL characters, ends in a newline character and none of the lines matches the ^-[neE]+$ extended regular expression (and/or depending on the environment or how bash was compiled like the sh of OS/X, doesn't contain backslash characters).

It's also very inefficient and not the way you process text in shells.

Here, you want:

cat > "$temp_file"

cat is a standard command, which when not given any argument just dumps its stdin onto its stdout as-is.

Related Solutions

Shell Pipes – How to Understand Pipes

About your performance question, pipes are more efficient than files because no disk IO is needed. So cmd1 | cmd2 is more efficient than cmd1 > tmpfile; cmd2 < tmpfile (this might not be true if tmpfile is backed on a RAM disk or other memory device as named pipe; but if it is a named pipe, cmd1 should be run in the background as its output can block if the pipe becomes full). If you need the result of cmd1 and still need to send its output to cmd2, you should cmd1 | tee tmpfile | cmd2 which will allow cmd1 and cmd2 to run in parallel avoiding disk read operations from cmd2.

Named pipes are useful if many processes read/write to the same pipe. They can also be useful when a program is not designed to use stdin/stdout for its IO needing to use files. I put files in italic because named pipes are not exactly files in a storage point of view as they reside in memory and have a fixed buffer size, even if they have a filesystem entry (for reference purpose). Other things in UNIX have filesystem entries without being files: just think of /dev/null or others entries in /dev or /proc.

As pipes (named and unnamed) have a fixed buffer size, read/write operations to them can block, causing the reading/writing process to go in IOWait state. Also, when do you receive an EOF when reading from a memory buffer ? Rules on this behavior are well defined and can be found in the man.

One thing you cannot do with pipes (named and unnamed) is seek back in the data. As they are implemented using a memory buffer, this is understandable.

About "everything in Linux/Unix is a file", I do not agree. Named pipes have filesystem entries, but are not exactly file. Unnamed pipes do not have filesystem entries (except maybe in /proc). However, most IO operations on UNIX are done using read/write function that need a file descriptor, including unnamed pipe (and socket). I do not think that we can say that "everything in Linux/Unix is a file", but we can surely say that "most IO in Linux/Unix is done using a file descriptor".

Shell Command – Difference Between ‘cat file | ./binary’ and ‘./binary < file'

./binary < file

binary's stdin is the file open in read-only mode. Note that bash doesn't read the file at all, it just opens it for reading on the file descriptor 0 (stdin) of the process it executes binary in.

In:

./binary << EOF
test
EOF

Depending on the shell, binary's stdin will be either a deleted temporary file (AT&T ksh, zsh, bash...) that contains test\n as put there by the shell or the reading end of a pipe (dash, yash; and the shell writes test\n in parallel at the other end of the pipe). In your case, if you're using bash, it would be a temp file.

In:

cat file | ./binary

Depending on the shell, binary's stdin will be either the reading end of a pipe, or one end of a socket pair where the writing direction has been shut down (ksh93) and cat is writing the content of file at the other end.

When stdin is a regular file (temporary or not), it is seekable. binary may go to the beginning or end, rewind, etc. It can also mmap it, do some ioctl()s like FIEMAP/FIBMAP (if using <> instead of <, it could truncate/punch holes in it, etc).

pipes and socket pairs on the other hand are an inter-process communication means, there's not much binary can do beside reading the data (though there are also some operations like some pipe-specific ioctl()s that it could do on them and not on regular files).

Most of the times, it's the missing ability to seek that causes applications to fail/complain when working with pipes, but it could be any of the other system calls that are valid on regular files but not on different types of files (like mmap(), ftruncate(), fallocate()). On Linux, there's also a big difference in behaviour when you open /dev/stdin while the fd 0 is on a pipe or on a regular file.

There are many commands out there that can only deal with seekable files, but when that's the case, that's generally not for the files open on their stdin.

$ unzip -l file.zip
Archive:  file.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       11  2016-12-21 14:43   file
---------                     -------
       11                     1 file
$ unzip -l <(cat file.zip)
     # more or less the same as cat file.zip | unzip -l /dev/stdin
Archive:  /proc/self/fd/11
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of /proc/self/fd/11 or
        /proc/self/fd/11.zip, and cannot find /proc/self/fd/11.ZIP, period.

unzip needs to read the index stored at the end of the file, and then seek within the file to read the archive members. But here, the file (regular in the first case, pipe in the second) is given as a path argument to unzip, and unzip opens it itself (typically on fd other than 0) instead of inheriting a fd already opened by the caller. It doesn't read zip files from its stdin. stdin is mostly used for user interaction.

If you run that binary of yours without redirection at the prompt of an interactive shell running in a terminal emulator, then binary's stdin will be inherited from its caller the shell, which itself will have inherited it from its caller the terminal emulator and will be a pty device open in read+write mode (something like /dev/pts/n).

Those devices are not seekable either. So, if binary works OK when taking input from the terminal, possibly the issue is not about seeking.

If that 14 is meant to be an errno (an error code set by failing system calls), then on most systems, that would be EFAULT (Bad address). The read() system call would fail with that error if asked to read into a memory address that is not writable. That would be independent of whether the fd to read the data from points to a pipe or regular file and would generally indicate a bug¹.

binary possibly determines the type of file open on its stdin (with fstat()) and runs into a bug when it's neither a regular file nor a tty device.

Hard to tell without knowing more about the application. Running it under strace (or truss/tusc equivalent on your system) could help us see what is the system call if any that is failing here.

¹ The scenario envisaged by Matthew Ife in a comment to your question sounds a lot plausible here. Quoting him:

I suspect it is seeking to the end of file to get a buffer size for reading the data, badly handling the fact that seek doesn't work and attempting to allocate a negative size (not handling a bad malloc). Passing the buffer to read which faults given the buffer is not valid.

Best Answer

Related Solutions

Shell Pipes – How to Understand Pipes

Shell Command – Difference Between ‘cat file | ./binary’ and ‘./binary < file'

Related Question