Shell xargs – How to Make xargs Handle Spaces and Special Characters

shellwhitespacexargs

I have a file that contains a list of names. i.e.:

Long Name One (001)
Long Name Two (201)
Long Name Three (123)
...

with spaces and some special characters. I wanted to make directories out of these names, i.e.:

cat file | xargs -l1 mkdir

It makes individual directories separated by spaces, i.e. Long, Name, One, Two, Three, instead of Long Name One (001), Long Name Two (201), Long Name Three (123).

How can I do that?

Best Answer

Use -d '\n' with your xargs command:

cat file | xargs -d '\n' -l1 mkdir

From manpage:

-d delim
              Input  items  are  terminated  by the specified character.  Quotes and backslash are not special; every
              character in the input is taken literally.  Disables the end-of-file string, which is treated like  any
              other  argument.   This can be used when the input consists of simply newline-separated items, although
              it is almost always better to design your program to use --null where this is possible.  The  specified
              delimiter  may be a single character, a C-style character escape such as \n, or an octal or hexadecimal
              escape code.  Octal and hexadecimal escape codes are understood as for the printf command.    Multibyte
              characters are not supported.

Example output:

$ ls
file

$ cat file
Long Name One (001)
Long Name Two (201)
Long Name Three (123)

$ cat file | xargs -d '\n' -l1 mkdir

$ ls -1
file
Long Name One (001)
Long Name Three (123)
Long Name Two (201)

UPDATE (2014-02-02)

Thanks to our very own @Anthon's determination in following the lack of this feature up, we have a slightly more formal reason as to why this feature is lacking, which reiterates what I explained earlier:

Re: [PATCH] ls: adding --zero/-z option, including tests

From:      Pádraig Brady
Subject:   Re: [PATCH] ls: adding --zero/-z option, including tests
Date:      Mon, 03 Feb 2014 15:27:31 +0000
Thanks a lot for the patch. If we were to do this then this is the interface we would use. However ls is really a tool for direct consumption by a human, and in that case further processing is less useful. For futher processing, find(1) is more suited. That is well described in the first answer at the link above.

So I'd be 70:30 against adding this.

My original answer

This is a bit of my personal opinion but I believe it to be a design decision in leaving that switch out of ls. If you notice the find command does have this switch:

-print0
      True; print the full file name on the standard output, followed by a 
      null character (instead of the newline character that -print uses).  
      This allows file  names  that  contain  newlines or other types of white 
      space to be correctly interpreted by programs that process the find 
      output.  This option corresponds to the -0 option of xargs.

By leaving that switch out, the designers were implying that you should not be using ls output for anything other than human consumption. For downstream processing by other tools, you should be using find instead.

Ways to use find

If you're just looking for the alternative methods you can find them here, titled: Doing it correctly: A quick summary. From that link these are likely the 3 more common patterns:

Simple find -exec; unwieldy if COMMAND is large, and creates 1 process/file:
```
find . -exec COMMAND... {} \;
```
Simple find -exec with +, faster if multiple files are okay for COMMAND:
```
find . -exec COMMAND... {} \+
```
Use find and xargs with \0 separators
(nonstandard common extensions -print0 and -0. Works on GNU, *BSDs, busybox)
```
find . -print0 | xargs -0 COMMAND
```

Further evidence?

I found this blog post from Joey Hess' blog titled: "ls: the missing options". One of the interesting comments in this post:

The only obvious lack now is a -z option, which should make output filenames be NULL terminated for consuption by other programs. I think this would be easy to write, but I've been extermely busy IRL (moving lots of furniture) and didn't get to it. Any takers to write it?

Further searching I found this in the commit logs from one of the additional switches that Joey's blog post mentions, "new output format -j", so it would seem that the blog post was poking fun at the notion of ever adding a -z switch to ls.

As to the other options, multiple people agree that -e is nearly almost useful, although none of us can quite find a reason to use it. My bug report neglected to mention that ls -eR is very buggy. -j is clearly a joke.

References

Bash Shell Script – Handling Whitespace and Special Characters

Always use double quotes around variable substitutions and command substitutions: `"$foo"`, `"$(foo)"`

If you use $foo unquoted, your script will choke on input or parameters (or command output, with $(foo)) containing whitespace or \[*?.

There, you can stop reading. Well, ok, here are a few more:

read — To read input line by line with the read builtin, use while IFS= read -r line; do …
Plain read treats backslashes and whitespace specially.
xargs — Avoid xargs. If you must use xargs, make that xargs -0. Instead of find … | xargs, prefer find … -exec ….
xargs treats whitespace and the characters \"' specially.

This answer applies to Bourne/POSIX-style shells (sh, ash, dash, bash, ksh, mksh, yash…). Zsh users should skip it and read the end of When is double-quoting necessary? instead. If you want the whole nitty-gritty, read the standard or your shell's manual.

Note that the explanations below contains a few approximations (statements that are true in most conditions but can be affected by the surrounding context or by configuration).

Why do I need to write `"$foo"`? What happens without the quotes?

$foo does not mean “take the value of the variable foo”. It means something much more complex:

First, take the value of the variable.
Field splitting: treat that value as a whitespace-separated list of fields, and build the resulting list. For example, if the variable contains foo * bar then the result of this step is the 3-element list foo, *, bar.
Filename generation: treat each field as a glob, i.e. as a wildcard pattern, and replace it by the list of file names that match this pattern. If the pattern doesn't match any files, it is left unmodified. In our example, this results in the list containing foo, following by the list of files in the current directory, and finally bar. If the current directory is empty, the result is foo, *, bar.

Note that the result is a list of strings. There are two contexts in shell syntax: list context and string context. Field splitting and filename generation only happen in list context, but that's most of the time. Double quotes delimit a string context: the whole double-quoted string is a single string, not to be split. (Exception: "$@" to expand to the list of positional parameters, e.g. "$@" is equivalent to "$1" "$2" "$3" if there are three positional parameters. See What is the difference between $* and $@?)

The same happens to command substitution with $(foo) or with `foo`. On a side note, don't use `foo`: its quoting rules are weird and non-portable, and all modern shells support $(foo) which is absolutely equivalent except for having intuitive quoting rules.

The output of arithmetic substitution also undergoes the same expansions, but that isn't normally a concern as it only contains non-expandable characters (assuming IFS doesn't contain digits or -).

See When is double-quoting necessary? for more details about the cases when you can leave out the quotes.

Unless you mean for all this rigmarole to happen, just remember to always use double quotes around variable and command substitutions. Do take care: leaving out the quotes can lead not just to errors but to security holes.

How do I process a list of file names?

If you write myfiles="file1 file2", with spaces to separate the files, this can't work with file names containing spaces. Unix file names can contain any character other than / (which is always a directory separator) and null bytes (which you can't use in shell scripts with most shells).

Same problem with myfiles=*.txt; … process $myfiles. When you do this, the variable myfiles contains the 5-character string *.txt, and it's when you write $myfiles that the wildcard is expanded. This example will actually work, until you change your script to be myfiles="$someprefix*.txt"; … process $myfiles. If someprefix is set to final report, this won't work.

To process a list of any kind (such as file names), put it in an array. This requires mksh, ksh93, yash or bash (or zsh, which doesn't have all these quoting issues); a plain POSIX shell (such as ash or dash) doesn't have array variables.

myfiles=("$someprefix"*.txt)
process "${myfiles[@]}"

Ksh88 has array variables with a different assignment syntax set -A myfiles "someprefix"*.txt (see assignation variable under different ksh environment if you need ksh88/bash portability). Bourne/POSIX-style shells have a single one array, the array of positional parameters "$@" which you set with set and which is local to a function:

set -- "$someprefix"*.txt
process -- "$@"

What about file names that begin with `-`?

On a related note, keep in mind that file names can begin with a - (dash/minus), which most commands interpret as denoting an option. Some commands (like sh, set or sort) also accept options that start with +. If you have a file name that begins with a variable part, be sure to pass -- before it, as in the snippet above. This indicates to the command that it has reached the end of options, so anything after that is a file name even if it starts with - or +.

Alternatively, you can make sure that your file names begin with a character other than -. Absolute file names begin with /, and you can add ./ at the beginning of relative names. The following snippet turns the content of the variable f into a “safe” way of referring to the same file that's guaranteed not to start with - nor +.

case "$f" in -* | +*) "f=./$f";; esac

On a final note on this topic, beware that some commands interpret - as meaning standard input or standard output, even after --. If you need to refer to an actual file named -, or if you're calling such a program and you don't want it to read from stdin or write to stdout, make sure to rewrite - as above. See What is the difference between "du -sh *" and "du -sh ./*"? for further discussion.

How do I store a command in a variable?

“Command” can mean three things: a command name (the name as an executable, with or without full path, or the name of a function, builtin or alias), a command name with arguments, or a piece of shell code. There are accordingly different ways of storing them in a variable.

If you have a command name, just store it and use the variable with double quotes as usual.

command_path="$1"
…
"$command_path" --option --message="hello world"

If you have a command with arguments, the problem is the same as with a list of file names above: this is a list of strings, not a string. You can't just stuff the arguments into a single string with spaces in between, because if you do that you can't tell the difference between spaces that are part of arguments and spaces that separate arguments. If your shell has arrays, you can use them.

cmd=(/path/to/executable --option --message="hello world" --)
cmd=("${cmd[@]}" "$file1" "$file2")
"${cmd[@]}"

What if you're using a shell without arrays? You can still use the positional parameters, if you don't mind modifying them.

set -- /path/to/executable --option --message="hello world" --
set -- "$@" "$file1" "$file2"
"$@"

What if you need to store a complex shell command, e.g. with redirections, pipes, etc.? Or if you don't want to modify the positional parameters? Then you can build a string containing the command, and use the eval builtin.

code='/path/to/executable --option --message="hello world" -- /path/to/file1 | grep "interesting stuff"'
eval "$code"

Note the nested quotes in the definition of code: the single quotes '…' delimit a string literal, so that the value of the variable code is the string /path/to/executable --option --message="hello world" -- /path/to/file1. The eval builtin tells the shell to parse the string passed as an argument as if it appeared in the script, so at that point the quotes and pipe are parsed, etc.

Using eval is tricky. Think carefully about what gets parsed when. In particular, you can't just stuff a file name into the code: you need to quote it, just like you would if it was in a source code file. There's no direct way to do that. Something like code="$code $filename" breaks if the file name contains any shell special character (spaces, $, ;, |, <, >, etc.). code="$code \"$filename\"" still breaks on "$\`. Even code="$code '$filename'" breaks if the file name contains a '. There are two solutions.

Add a layer of quotes around the file name. The easiest way to do that is to add single quotes around it, and replace single quotes by '\''.
```
quoted_filename=$(printf %s. "$filename" | sed "s/'/'\\\\''/g")
code="$code '${quoted_filename%.}'"
```
Keep the variable expansion inside the code, so that it's looked up when the code is evaluated, not when the code fragment is built. This is simpler but only works if the variable is still around with the same value at the time the code is executed, not e.g. if the code is built in a loop.
```
code="$code \"\$filename\""
```

Finally, do you really need a variable containing code? The most natural way to give a name to a code block is to define a function.

What's up with `read`?

Without -r, read allows continuation lines — this is a single logical line of input:

hello \
world

read splits the input line into fields delimited by characters in $IFS (without -r, backslash also escapes those). For example, if the input is a line containing three words, then read first second third sets first to the first word of input, second to the second word and third to the third word. If there are more words, the last variable contains everything that's left after setting the preceding ones. Leading and trailing whitespace are trimmed.

Setting IFS to the empty string avoids any trimming. See Why is `while IFS= read` used so often, instead of `IFS=; while read..`? for a longer explanation.

What's wrong with `xargs`?

The input format of xargs is whitespace-separated strings which can optionally be single- or double-quoted. No standard tool outputs this format.

The input to xargs -L1 or xargs -l is almost a list of lines, but not quite — if there is a space at the end of a line, the following line is a continuation line.

You can use xargs -0 where applicable (and where available: GNU (Linux, Cygwin), BusyBox, BSD, OSX, but it isn't in POSIX). That's safe, because null bytes can't appear in most data, in particular in file names. To produce a null-separated list of file names, use find … -print0 (or you can use find … -exec … as explained below).

How do I process files found by `find`?

find … -exec some_command a_parameter another_parameter {} +

some_command needs to be an external command, it can't be a shell function or alias. If you need to invoke a shell to process the files, call sh explicitly.

find … -exec sh -c '
  for x do
    … # process the file "$x"
  done
' find-sh {} +

I have some other question

Browse the quoting tag on this site, or shell or shell-script. (Click on “learn more…” to see some general tips and a hand-selected list of common questions.) If you've searched and you can't find an answer, ask away.

Best Answer

Related Solutions

Shell – reason why ls does not have a –zero or -0 option

UPDATE (2014-02-02)

My original answer

Ways to use find

Further evidence?

References

Bash Shell Script – Handling Whitespace and Special Characters

Always use double quotes around variable substitutions and command substitutions: "$foo", "$(foo)"

Why do I need to write "$foo"? What happens without the quotes?

How do I process a list of file names?

What about file names that begin with -?

How do I store a command in a variable?

What's up with read?

What's wrong with xargs?

How do I process files found by find?