Command Line – Performance Difference Between stdin and Command Line Argument

catcommand lineio-redirectionpipe

For some commands, it is possible to specify certain input as either stdin or a command line argument.

Specifically, suppose command can take stdin input and a filename as command line argument, and command < myfile, cat myfile | command and command myfile can produce the same result.

For example,

When the command is sed:

sed s/day/night/ <myfile >new   
sed s/day/night/ myfile >new    
cat myfile | sed s/day/night/ >new

When the command is cat:

cat < myfile
cat myfile

I was wondering if there are some general rules about their
performances, i.e. which one of them is usually the most efficient,
and which the least?
Is redirection always better than pipe?

Best Answer

The cat file | command syntax is considered a Useless Use of Cat. Of all your options, it takes a performance hit because it has to spawn another process in the kernel. However insignificant this may turn out to be in the big picture, it's overhead the other forms don't have. This has been covered on questions such as: Should I care about unnecessary cats?

Between the other two forms there are virtually no performance differences. STDIN is a special file node that the process has to open and read just like any other. Passing a file name instead of STDIN just makes it open a different file.

The difference would be in what features / flexibility you are looking for.

Passing the file name to the program would mean the input file was seekable. This may or may not matter to the program but some operations can be sped up if the stream is seekable.
Knowing the actual input file allows your program to potentially write to it. For example sed -i for in-place editing. (Note: since this has to create a new file behind the scenes it's not a performance gain over other redirects but it is a convenience step.)
Using shell redirects gives you the ability to concatenate multiple files or even use process redirection. sed [exp] < file1 file2 or even sed [exp] < <(grep command). Details of this use case can be found on this question: Process substitution and pipe

Related Solutions

Shell Script Performance – Should Unnecessary Cats Be Avoided?

The "definitive" answer is of course brought to you by The Useless Use of cat Award.

The purpose of cat is to concatenate (or "catenate") files. If it's only one file, concatenating it with nothing at all is a waste of time, and costs you a process.

Instantiating cat just so your code reads differently makes for just one more process and one more set of input/output streams that are not needed. Typically the real hold-up in your scripts is going to be inefficient loops and actuall processing. On most modern systems, one extra cat is not going to kill your performance, but there is ~~almost~~ always another way to write your code.

Most programs, as you note, are able to accept an argument for the input file. However, there is always the shell builtin < that can be used wherever a STDIN stream is expected which will save you one process by doing the work in the shell process that is already running.

You can even get creative with WHERE you write it. Normally it would be placed at the end of a command before you specify any output redirects or pipes like this:

sed s/blah/blaha/ < data | pipe

But it doesn't have to be that way. It can even come first. For instance your example code could be written like this:

< data \
    sed s/bla/blaha/ |
    grep blah |
    grep -n babla

If script readability is your concern and your code is messy enough that adding a line for cat is expected to make it easier to follow, there are other ways to clean up your code. One that I use a lot that helps make scripts easiy to figure out later is breaking up pipes into logical sets and saving them in functions. The script code then becomes very natural, and any one part of the pipline is easier to debug.

function fix_blahs () {
    sed s/bla/blaha/ |
    grep blah |
    grep -n babla
}

fix_blahs < data

You could then continue with fix_blahs < data | fix_frogs | reorder | format_for_sql. A pipleline that reads like that is really easy to follow, and the individual components can be debuged easily in their respective functions.

Command Line Interface – General Specification

I recommend reading a book on unix or Linux shell and command line usage, in order to learn basic usage and get a feeling for some advanced features. Then you can turn to reference documentation.

The usage of specific commands is described in their manual. man cat will show the manual of the cat command on your system. Manual pages are usually references, not tutorials, though they often contain examples. On Linux, cat --help shows a terse usage message (meant for quick perusal when you already know the fundamentals and want to find an option for a specific task).

The POSIX standard specifies a minimum set of commands, options and shell features that every unix system is supposed to support. Most current systems by and large support POSIX:2004 (also known as Single UNIX version 3 and the Open Group Base Specifications issue 6). GNU software (the utilities found on Linux) often have many extensions to this minimum set.

There are common conventions for command-line arguments. POSIX specifies utility conventions that most utilities follow, in particular:

Options consist of - followed by a single letter; -ab is shorthand for -a -b.
-- signifies the end of options. For example, in rm -- -a, -a is not an option but an operand, i.e. a file to act upon, so this commands removes the file called -a.
A lone - stands for standard input, where an input file is expected. It stands for standard output where an output file is expected.

GNU utilities and others also support “long options” of the form --name. Some utilities go against the general convention and take multi-letter options with a single leading dash: -name.

Redirection is a shell feature, so you'll find it in your shell's manual. <<< to use a string as standard input is a ksh extension, also supported by bash and zsh. As long as the shell supports it, it can be used on any command.

Best Answer

Related Solutions

Shell Script Performance – Should Unnecessary Cats Be Avoided?

Command Line Interface – General Specification

Related Question