Shell – Why is it that these two ‘cat’ commands result differently

io-redirectionshell

Let's assume that infile contains a specific text, and I were to execute the following set of commands:

exec 3<infile

cat -n <&3

cat -n <&3

The first instance of cat will display the file's contents, but the second time does not seem to be doing anything. Why do they differ?

Best Answer

They look like the same command but the reason they differ is the system state has changed as a result of the first command. Specifically, the first cat consumed the entire file, so the second cat has nothing left to read, hits EOF (end of file) immediately, and exits.

The reason behind this is you are using the exact same file description (the one you created with exec < infile and assigned to the file descriptor 3) for both invocations of cat. One of the things associated with an open file description is a file offset. So, the first cat reads the entire file, leaves the offset at the end, and the second one tries to pick up from the end of the file and finds nothing to read.

Related Solutions

Echo vs Cat – Understanding Execution Time Differences

There are several things to consider here.

i=`cat input`

can be expensive and there's a lot of variations between shells.

That's a feature called command substitution. The idea is to store the whole output of the command minus the trailing newline characters into the i variable in memory.

To do that, shells fork the command in a subshell and read its output through a pipe or socketpair. You see a lot of variation here. On a 50MiB file here, I can see for instance bash being 6 times as slow as ksh93 but slightly faster than zsh and twice as fast as yash.

The main reason for bash being slow is that it reads from the pipe 128 bytes at a time (while other shells read 4KiB or 8KiB at a time) and is penalised by the system call overhead.

zsh needs to do some post-processing to escape NUL bytes (other shells break on NUL bytes), and yash does even more heavy-duty processing by parsing multi-byte characters.

All shells need to strip the trailing newline characters which they may be doing more or less efficiently.

Some may want to handle NUL bytes more gracefully than others and check for their presence.

Then once you have that big variable in memory, any manipulation on it generally involves allocating more memory and coping data across.

Here, you're passing (were intending to pass) the content of the variable to echo.

Luckily, echo is built-in in your shell, otherwise the execution would have likely failed with an arg list too long error. Even then, building the argument list array will possibly involve copying the content of the variable.

The other main problem in your command substitution approach is that you're invoking the split+glob operator (by forgetting to quote the variable).

For that, shells need to treat the string as a string of characters (though some shells don't and are buggy in that regard) so in UTF-8 locales, that means parsing UTF-8 sequences (if not done already like yash does), look for $IFS characters in the string. If $IFS contains space, tab or newline (which is the case by default), the algorithm is even more complex and expensive. Then, the words resulting from that splitting need to be allocated and copied.

The glob part will be even more expensive. If any of those words contain glob characters (*, ?, [), then the shell will have to read the content of some directories and do some expensive pattern matching (bash's implementation for instance is notoriously very bad at that).

If the input contains something like /*/*/*/../../../*/*/*/../../../*/*/*, that will be extremely expensive as that means listing thousands of directories and that can expand to several hundred MiB.

Then echo will typically do some extra processing. Some implementations expand \x sequences in the argument it receives, which means parsing the content and probably another allocation and copy of the data.

On the other hand, OK, in most shells cat is not built-in, so that means forking a process and executing it (so loading the code and the libraries), but after the first invocation, that code and the content of the input file will be cached in memory. On the other hand, there will be no intermediary. cat will read large amounts at a time and write it straight away without processing, and it doesn't need to allocate huge amount of memory, just that one buffer that it reuses.

It also means that it's a lot more reliable as it doesn't choke on NUL bytes and doesn't trim trailing newline characters (and doesn't do split+glob, though you can avoid that by quoting the variable, and doesn't expand escape sequence though you can avoid that by using printf instead of echo).

If you want to optimise it further, instead of invoking cat several times, just pass input several times to cat.

yes input | head -n 100 | xargs cat

Will run 3 commands instead of 100.

To make the variable version more reliable, you'd need to use zsh (other shells can't cope with NUL bytes) and do it:

zmodload zsh/mapfile
var=$mapfile[input]
repeat 10 print -rn -- "$var"

If you know the input doesn't contain NUL bytes, then you can reliably do it POSIXly (though it may not work where printf is not builtin) with:

i=$(cat input && echo .) || exit # add an extra .\n to avoid trimming newlines
i=${i%.} # remove that trailing dot (the \n was removed by cmdsubst)
n=10
while [ "$n" -gt 10 ]; do
  printf %s "$i"
  n=$((n - 1))
done

But that is never going to be more efficient than using cat in the loop (unless the input is very small).

Bash IO Redirection – How Can a File Redirected for Input Be Written To and Can It Be Prevented?

That's due to the way /dev/stdin (actually /proc/self/fd/0) is implemented on Linux (and Cygwin, but generally not other systems).

On Linux opening /dev/stdin is not like doing a dup(0), it just reopens the same file as open on fd 0 anew. It doesn't share the open file description that fd 0 refers to (with the readonly mode), but gets a completely unrelated new open file description, with the mode as specified in open().

So if sops -d /dev/stdin opens /dev/stdin in read+write mode and fd 0 was open in read-only on /some/file, /some/file will be open in read+write.

Effectively, cmd /dev/stdin < file there is the same as cmd file < file. You'll find that /dev/stdin is just a symlink¹ to file:

/tmp$ namei -l /dev/stdin < file
f: /dev/stdin
drwxr-xr-x root     root     /
drwxr-xr-x root     root     dev
lrwxrwxrwx root     root     stdin -> /proc/self/fd/0
drwxr-xr-x root     root       /
dr-xr-xr-x root     root       proc
lrwxrwxrwx root     root       self -> 73569
dr-xr-xr-x stephane stephane     73569
dr-x------ stephane stephane   fd
lr-x------ stephane stephane   0 -> /tmp/file
drwxr-xr-x root     root         /
drwxrwxrwt root     root         tmp
-rw-r--r-- stephane stephane     file

It can get worse. If it was opening with O_TRUNCATE, the file would be truncated. If fd 0 was pointing to the reading end of a pipe and /dev/stdin was open in write-only mode, you'd get the other end of the pipe.

But using:

cat file | cmd /dev/stdin

Would guard against cmd overwriting file as all cmd would see would be the pipe. And even if it did open in write-only mode, it couldn't get back to the file, it would just get to the writing end of the pipe and the only file descriptor on the reading end would be cmd's stdin.

Other OSes don't have the problem as opening /dev/stdin there is like doing a dup(0), so you get the same open file description and if you open with an incompatible mode, the open() system call just fails.

^{¹ technically, as noted by @user414777 in comments, /proc/<pid>/fd/<fd> are magic symlinks in that for instance they can reach into places that normal symlinks could not, but when it comes to opening them, past the path resolution stage, they act like normal symlinks in that you just open the target file}

Best Answer

Related Solutions

Echo vs Cat – Understanding Execution Time Differences

Bash IO Redirection – How Can a File Redirected for Input Be Written To and Can It Be Prevented?

Related Question