How to avoid broken pipe in commands with cat

catemacspipe

Why does this simple command fails using emacs shell (eshell)?

cat file.txt | wc

I have a file with 10241 lines. Each line has less than 50 characters. Around 90% of the times I launch this command, it gives the wrong result, namely line count. Nonetheless, no error messages are given.

Looks like broken pipe is a very common topic, but I haven't found any reasonable explanation. Also, no one proposes any workarounds. How can I get this simple command working reliably?

Of course, I could've just run wc file.txt. But I'm looking for a more general solution in which any tool would work fine piped cat: cat file.txt | any_tool_here.

Details

I'm using CentOS 5. This issue appears when using eshell (emacs shell). I'm using GNU Emacs 24.5.2.

Experiments

Samples of results using cat file.txt | wc (expected: first column to be always 10241).

8568 25706 110571
9837 29513 126947
5395 16187 69615
9202 27608 118757
7299 21899 94199
9837 29513 126947

Sample of results using wc file.txt:

10241 30723 132156
10241 30723 132156
10241 30723 132156
10241 30723 132156
10241 30723 132156
10241 30723 132156

The cat command itself (when executed alone) is working properly. I validated it with the following command (a few times): cat file.txt > file2.txt. Then, I diff'd both files and they are identical.

Best Answer

Gathering from the information about the shell that was used (eshell), it appears that the streaming aspect of this shell is the culprit. Normally, piping means opening two ends of a pipe + fork/exec, so you get two processes that share a file descriptor to a pipe, and communication goes directly through the kernel. This way, nothing can get lost - it's guaranteed to be safe (although if it the pipe or any involved stream are buffered, you may have to wait for the first process to exit normally to flush out the last chunk of the stream).

Judging from the excerpt from eshell manual:

Eshell is not a replacement for system shells such as bash or zsh. Use Eshell when you want to move text between Emacs and external processes; if you only want to pipe output from one external process to another (and then another, and so on), use a system shell, because Emacs’s IO system is buffer oriented, not stream oriented, and is very inefficient at such tasks. If you want to write shell scripts in Eshell, don’t; either write an elisp library or use a system shell.

eshell is not doing it the normal way, but fakes the pipe using its "buffers" (emacs' representation of open files) as intermediate deposit for data, and (without further research) I'd guess that at some point, wc performs a read, and emacs responds with an empty chunk (and returning 0 from read is a signal that the stream has ended) instead of waiting for more input from the first program to fill the buffer. If that's the case, it means that eshell is not only inefficient but buggy when it comes to pipes.

Related Solutions

Why does pipe not work with cat and locate

locate -e0 '*/pg_type.h' | xargs -r0 cat

locate pg_type.h would find all the files with pg_type.h in their path (so for instance if there was a rpg_type.horn directory, you'd end up displaying all the files in there).

Without -0 the output of locate can't be post-processed because the files are separated by newline characters while newline is a perfectly valid character in a file name.

cat without arguments writes to stdout what it reads from stdin, so locate | cat would be the same as locate, cat would just pass the output of locate along. What you need is to pass the list of files as arguments to cat.

That's what xargs is typically for: convert a stream of data into a list of arguments. -r is to not call cat if there's no input. Without -0 (which like -r is not standard but found on many implementations, at least those where xargs is useful to anything), xargs would just look for words in its input to convert into arguments, where words are blank separated and where backslash, single and double quotes can be used to escape those separators, so typically not the format locate uses to display file names.

That's why we use the -0 option for both locate and xargs which uses the NUL character (which is the only character not allowed in a file path) to separate file names.

Also note that locate is not a standard command and there exist a great number of different implementations with different versions thereof and different options and behaviours. The code above applies at least to relatively recent versions of the GNU locate and mlocate implementations which are the most common on Linux based operating systems at least.

What makes a Unix process die with Broken pipe

A process receives a SIGPIPE when it attempts to write to a pipe (named or not) or socket of type SOCK_STREAM that has no reader left.

It's generally wanted behaviour. A typical example is:

find . | head -n 1

You don't want find to keep on running once head has terminated (and then closed the only file descriptor open for reading on that pipe).

The yes command typically relies on that signal to terminate.

yes | some-command

Will write "y" until some-command terminates.

Note that it's not only when commands exit, it's when all the reader have closed their reading fd to the pipe. In:

yes | ( sleep 1; exec <&-; ps -fC yes)
      1 2       1        0

There will be 1 (the subshell), then 2 (subshell + sleep), then 1 (subshell) then 0 fd reading from the pipe after the subshell explicitely closes its stdin, and that's when yes will receive a SIGPIPE.

Above, most shells use a pipe(2) while ksh93 uses a socketpair(2), but the behaviour is the same in that regard.

When a process ignores the SIGPIPE, the writing system call (generally write, but could be pwrite, send, splice...) returns with a EPIPE error. So processes wanting to handle the broken pipe manually would typically ignore SIGPIPE and take action upon a EPIPE error.

Best Answer

Related Solutions

Why does pipe not work with cat and locate

What makes a Unix process die with Broken pipe

Related Question