Bash – Reading Character by Character with Read


I've been trying to use bash to read a file character by character.

After much trial and error, I have discovered that this works:

exec 4<file.txt 
declare -i n
while read -r ch <&4; 
     while [ ! $n -eq ${#ch} ]
           do  echo -n "${ch:$n:1}"
               (( n++ ))
     echo "" 

I.e., I can read it line by line and then loop through each line char by char.

Before doing this, I had tried:
exec 4<file.txt && while read -r -n1 ch <&4; do; echo -n "$ch"; done
but it would skip all whitespaces in the file.

Could you please explain why? Is there a way to make the second strategy (i.e. reading char by char with bash's read) work?

Best Answer

You need to remove whitespace characters from the $IFS parameter for read to stop skipping leading and trailing ones (with -n1, the whitespace character if any would be both leading and trailing, so skipped):

while IFS= read -rn1 a; do printf %s "$a"; done

But even then bash's read will skip newline characters, which you can work around with:

while IFS= read -rn1 a; do printf %s "${a:-$'\n'}"; done

Though you could use IFS= read -d '' -rn1 instead or even better IFS= read -N1 (added in 4.1, copied from ksh93 (added in o)) which is the command to read one character.

Note that bash's read can't cope with NUL characters. And ksh93 has the same issues as bash.

With zsh:

while read -ku0 a; do print -rn -- "$a"; done

(zsh can cope with NUL characters).

Note that those read -k/n/N read a number of characters, not bytes. So for multibyte characters, they may have to read multiple bytes until a full character is read. If the input contains invalid characters, you may end up with a variable that contains a sequence of bytes that doesn't form valid characters and which the shell may end up counting as several characters. For instance in a UTF-8 locale:

$ printf '\375\200\200\200\200ABC' | bash -c '
    IFS= read  -rN1 a; echo "${#a}"'

That \375 would introduce a 6-byte UTF-8 character. However, the 6th one (A) above is invalid for a UTF-8 character. You still end-up with \375\200\200\200\200A in $a, which bash counts as 6 characters though the first 5 ones are not really characters, just 5 bytes not forming part of any character.

Related Question