Shell – How to Protect an IFS Character from Field Splitting

bourne-shellposixshell

In a POSIX sh, or in the Bourne shell (as in Solaris 10's /bin/sh), is it possible to have something like:

a='some var with spaces and a special space'
printf "%s\n" $a

And, with the default IFS, get:

some
var
with
spaces
and
a
special space

That is, protect the space between special and space by some combination of quoting or escaping?

The number of words in a isn't known beforehand, or I'd try something like:

a='some var with spaces and a special\ space'
printf "%s\n" "$a" | while read field1 field2 ...

The context is this bug reported in Cassandra, where OP tried to set an environment variable specifying options for the JVM:

export JVM_EXTRA_OPTS='-XX:OnOutOfMemoryError="echo oh_no"'

In the script executing Cassandra, which has to support POSIX sh and Solaris sh:

JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS"
#...
exec $NUMACTL "$JAVA" $JVM_OPTS $cassandra_parms -cp "$CLASSPATH" $props "$class"

IMO the only way out here is to use a script wrapping the echo oh_no command. Is there another way?

Best Answer

Not really.

One solution is to reserve a character as the field separator. Obviously it will not be possible to include that character, whatever it is, in an option. Tab and newline are obvious candidates, if the source language makes it easy to insert them. I would avoid multibyte characters if you want portability (e.g. dash and BusyBox don't support multibyte characters).

If you rely on IFS splitting, don't forget to turn off wildcard expansion with set -f.

tab=$(printf '\t')
IFS=$tab
set -f
exec java $JVM_EXTRA_OPTS …

Another approach is to introduce a quoting syntax. A very common quoting syntax is that a backslash protects the next character. The downside of using backslashes is that so many different tools use it as a quoting characters that it can sometimes be difficult to figure out how many backslashes you need.

set java
eval 'set -- "$@"' $(printf '%s\n' "$JVM_EXTRA_OPTS" | sed -e 's/[^ ]/\\&/g' -e 's/\\\\/\\/g') …
exec "$@"

Converting a standalone file

If you run the following command:

$ dos2unix <file>

The <file> will have all the ^M characters stripped. If you want to leave <file> intact, then simply run dos2unix like this:

$ dos2unix -n <file> <newfile>

Parsing output from a command

If you need to do them as part of a chain of commands via a pipe, you can use any number of tools such as tr, sed, awk, or perl to do this.

$ java -jar test.jar | tr -d '^M' >> test.log

sed

$ java -jar test.jar | sed 's/^M//g' >> test.log

awk

$ java -jar test.jar | awk 'sub(/^M/,"")' >> test.log

perl

$ java -jar test.jar | perl -p -e 's/^M//g' >> test.log

Typing ^M

When entering the ^M be sure to enter it in one of the following ways:

As Control + v + M and not Shift + 6 + M.
As a backslash r, i.e. (\r).
As an octal number (\015).
As a hexidecimal number (\x0D).

Why is this necessary?

The ^M is part of how end of lines are terminated on the Windows platform. Each end of line is terminated with a carriage return character followed by a linefeed character.

On Unix systems the end of line is terminated by just a linefeed character.

linefeed character = 0x0A in hex, also written as \n.
carriage return character = 0x0D in hex, also written as \r.

Examples

You can see these if you pipe the output to a tool such as od or hexdump. Here's a sample file with the line terminating carriage returns + linefeed characters.

$ cat sample.txt
hi there
bye there

You can see them with hexdump as \r + \n:

$ hexdump -c sample.txt 
0000000   h   i       t   h   e   r   e  \r  \n   b   y   e       t   h
0000010   e   r   e  \r  \n                                            
0000015

Or as their hexidecimal 0d + 0a:

$ hexdump -C sample.txt 
00000000  68 69 20 74 68 65 72 65  0d 0a 62 79 65 20 74 68  |hi there..bye th|
00000010  65 72 65 0d 0a                                    |ere..|
00000015

Running this through sed 's/\r//g':

$ sed 's/\r//g' sample.txt |hexdump -C
00000000  68 69 20 74 68 65 72 65  0a 62 79 65 20 74 68 65  |hi there.bye the|
00000010  72 65 0a                                          |re.|
00000013

You can see that sed has removed the 0d character.

Viewing files with ^M without converting?

Yes you can use vim to do this. You can either set the fileformat setting in vim, which will have the effect of converting the file like we were doing above, or you can change the fileformat in the vim view.

changing a file's format

:set fileformat=dos
:set fileformat=unix

You can use the shorthand notation too:

:set ff=dos
:set ff=unix

Alternatively you can just change the fileformat of the view. This approach is nondestructive:

:e ++ff=dos
:e ++ff=unix

Here you can see me opening our ^M file, sample.txt in vim:

ss of vim dos #1

Now I'm converting the fileformat in the view:

ss of vim dos #2

Here's what it looks like when converted to the unix fileformat:

ss of vim dos #3

Best Answer

Related Solutions

Bash – FS (Internal Field Separator) function as a single separator for multiple consecutive delimiter chars

Shell – Remove ^M character from log files