Bash – How to work with binary in bash, to copy bytes verbatim without any conversion

bashbinaryhead

I am ambitiously trying to translate a c++ code into bash for a myriad of reasons.

This code reads and manipulates a file type specific to my sub-field that is written and structured completely in binary. My first binary-related task is to copy the first 988 bytes of the header, exactly as-is, and put them into an output file that I can continue writing to as I generate the rest of the information.

I am pretty sure that my current solution isn't working, and realistically I haven't figured out a good way to determine this. So even if it is actually written correctly, I need to know how I would test this to be sure!

This is what I'm doing right now:

hdr_988=`head -c 988 ${inputFile}`
echo -n "${hdr_988}" > ${output_hdr}
headInput=`head -c 988 ${inputTrack} | hexdump`
headOutput=`head -c 988 ${output_hdr} | hexdump`
if [ "${headInput}" != "${headOutput}" ]; then echo "output header was not written properly.  exiting.  please troubleshoot."; exit 1; fi

If I use hexdump/xxd to check out this part of the file, although I can't exactly read most of it, something seems wrong. And the code I have written in for comparison only tells me if two strings are identical, not if they are copied the way I want them to be.

Is there a better way to do this in bash? Can I simply copy/read binary bytes in native-binary, to copy to a file verbatim? (and ideally to store as variables as well).

Best Answer

Dealing with binary data at a low level in shell scripts is generally a bad idea.

bash variables can't contain the byte 0. zsh is the only shell that can store that byte in its variables.

In any case, command arguments and environment variables cannot contain those bytes as they are NUL delimited strings passed to the execve system call.

Also note that:

var=`cmd`

or its modern form:

var=$(cmd)

strips all the trailing newline characters from the output of cmd. So, if that binary output ends in 0xa bytes, it will be mangled when stored in $var.

Here, you'd need to store the data encoded, for instance with xxd -p.

hdr_988=$(head -c 988 < "$inputFile" | xxd -p)
printf '%s\n' "$hdr_988" | xxd -p -r > "$output_hdr"

You could define helper functions like:

encode() {
  eval "$1"='$(
    shift
    "$@" | xxd -p  -c 0x7fffffff
    exit "${PIPESTATUS[0]}")'
}

decode() {
  printf %s "$1" | xxd -p -r
}

encode var cat /bin/ls &&
  decode "$var" | cmp - /bin/ls && echo OK

xxd -p output is not space efficient as it encodes 1 byte in 2 bytes, but it makes it easier to do manipulations with it (concatenating, extracting parts). base64 is one that encodes 3 bytes in 4, but is not as easy to work with.

The ksh93 shell has a builtin encoding format (uses base64) which you can use with its read and printf/print utilities:

typeset -b var # marked as "binary"/"base64-encoded"
IFS= read -rn 988 var < input
printf %B var > output

Now, if there's no transit via shell or env variables, or command arguments, you should be OK as long as the utilities you use can handle any byte value. But note that for text utilities, most non-GNU implementations can't handle NUL bytes, and you'll want to fix the locale to C to avoid problems with multi-byte characters. The last character not being a newline character can also cause problems as well as very long lines (sequences of bytes in between two 0xa bytes that are longer that LINE_MAX).

head -c where it's available should be OK here, as it's meant to work with bytes, and has no reason to treat the data as text. So

head -c 988 < input > output

should be OK. In practice at least the GNU, FreeBSD and ksh93 builtin implementations are OK. POSIX doesn't specify the -c option, but says head should support lines of any length (not limited to LINE_MAX)

With zsh:

IFS= read -rk988 -u0 var < input &&
print -rn -- $var > output

Or:

var=$(head -c 988 < input && echo .) && var=${var%.}
print -rn -- $var > output

Even in zsh, if $var contains NUL bytes, you can pass it as argument to zsh builtins (like print above) or functions, but not as arguments to executables, as arguments passed to executables are NUL delimited strings, that's a kernel limitation, independent of the shell.

Related Solutions

Bash – In bash, how to convert 8 bytes to an unsigned int (64bit LE)

Bash is the wrong tool altogether. Shells are good at gluing bits and pieces together; text processing and arithmetic are provided on the side, and data processing isn't in their purview at all.

I'd go for Python over Perl, because Python has bignums right off the bat. Use struct.unpack to unpack the data.

#!/usr/bin/env python
import os, struct, sys
fmt = "<" + "Q" * 8192
header_bytes = sys.stdin.read(65536)
header_ints = list(struct.unpack(fmt, header_bytes))
sys.stdin.seek(-65536, 2)
footer_bytes = sys.stdin.read(65536)
footer_ints = list(struct.unpack(fmt, header_bytes))
# your calculations here

Here's my answer to the original question. The revised question doesn't have much to do with the original, which was about converting one 8-byte sequence into the 64-bit integer it represents in little-endian order.

I don't think bash has any built-in feature for this. The following snippet sets a to a string that is the hexadecimal representation of the number that corresponds to the bytes in the specified string in big endian order.

a=0x$(printf "%s" "$string" |
      od -t x1 -An |
      tr -dc '[:alnum:]')

For little-endian order, reverse the order of the bytes in the original string. In bash, and for a string of known length, you can do

a=0x$(printf "%s" "${string:7:1}${string:6:1}${string:5:1}${string:4:1}${string:3:1}${string:2:1}${string:1:1}${string:0:1}" |
      od -t x1 -An |
      tr -dc '[:alnum:]')

You can also get your platform's prefered endianness if your od supports 8-byte types.

a=0x$(printf "%s" "$string" |
      od -t x8 -An |
      tr -dc '[:alnum:]')

Whether you can do arithmetic on $a will depend on whether your bash supports 8-byte arithmetic. Even if it does, it'll treat it as a signed value.

Alternatively, use Perl:

a=0x$(perl -e 'print unpack "Q<", $ARGV[0]' "$string")

If your perl is compiled without 64-bit integer support, you'll need to break the bytes up.

a=0x$(perl -e 'printf "%x%08x\n", reverse unpack "L<L<", $ARGV[0]' "$string")

(Replace < by > for big-endian or remove it to get the platform endianness.)

How to use Bash to find 2 bytes in a binary file, increase their values, and replace

Testing with this file:

$ echo hello world > test.txt
$ echo -n $'\x1b\x1f' >> test.txt
$ echo whatever >> test.txt
$ hexdump -C test.txt 
00000000  68 65 6c 6c 6f 20 77 6f  72 6c 64 0a 1b 1f 77 68  |hello world...wh|
00000010  61 74 65 76 65 72 0a                              |atever.|
$ grep -a -b --only-matching $'\x1b\x1f' test.txt 
12:

So in this case the 1B 1F is at position 12.

Convert to integer (there is probably an easier way)

$ echo 'ibase=16; '`xxd -u -ps -l 2 -s 12 test.txt`  | bc
6943

And the reverse:

$ printf '%04X' 6943 | xxd -r -ps | hexdump -C
00000000  1b 1f                                             |..|
$ printf '%04X' 4242 | xxd -r -ps | hexdump -C
00000000  10 92                                             |..|

And putting it back in the file:

$ printf '%04X' 4242 | xxd -r -ps | dd of=test.txt bs=1 count=2 seek=12 conv=notrunc
2+0 records in
2+0 records out
2 bytes (2 B) copied, 5.0241e-05 s, 39.8 kB/s

Result:

$ hexdump -C test.txt
00000000  68 65 6c 6c 6f 20 77 6f  72 6c 64 0a 10 92 77 68  |hello world...wh|
00000010  61 74 65 76 65 72 0a                              |atever.|

Best Answer

Related Solutions

Bash – In bash, how to convert 8 bytes to an unsigned int (64bit LE)

How to use Bash to find 2 bytes in a binary file, increase their values, and replace

Related Question