bash – Bash Scripting and Large Files (Bug): Input with the Read Builtin from a Redirection Gives Unexpected Result

bash

I have a strange issue with large files and bash. This is the context:

I have a large file: 75G and 400,000,000+ lines (it is a log file, my bad, I let it grow).
The first 10 chars of each line is a time stamps in the format YYYY-MM-DD.
I want to split that file: one file per day.

I tried with the following script that did not work. My question is about this script not working, not alternative solutions.

while read line; do
  new_file=${line:0:10}_file.log
  echo "$line" >> $new_file
done < file.log

After debugging, I found the problem in the new_file variable. This script:

while read line; do
  new_file=${line:0:10}_file.log
  echo $new_file
done < file.log | uniq -c

gives the result bellow (I put the xes to keep the data confidential, other chars are the real ones). Notice the dh and the shorter strings:

...
  27402 2011-xx-x4
  27262 2011-xx-x5
  22514 2011-xx-x6
  17908 2011-xx-x7
...
3227382 2011-xx-x9
4474604 2011-xx-x0
1557680 2011-xx-x1
      1 2011-xx-x2
      3 2011-xx-x1
...
     12 2011-xx-x1
      1 2011-xx-dh
      1 2011-xx-x1
      1 208--
      1 2011-xx-x1
      1 2011-xx-dh
      1 2011-xx-x1    
...

It is not a problem in the format of my file. The script cut -c 1-10 file.log | uniq -c gives only valid time stamps. Interestingly, a part of the the above output becomes with cut ... | uniq -c:

3227382 2011-xx-x9
4474604 2011-xx-x0
5722027 2011-xx-x1

We can see that after the uniq count 4474604, my initial script failed.

Did I hit a limit in bash that I do not know, did I find a bug in bash (it seams unlikely), or have I done something wrong ?

Update:

The problem happens after reading 2G of the file. It seams read and redirection do not like larger files than 2G. But still searching for a more precise explanation.

Update2:

It definitively looks like a bug. It can be reproduced with:

yes "0123456789abcdefghijklmnopqrs" | head -n 100000000 > file
while read line; do file=${line:0:10}; echo $file; done < file | uniq -c

but this works fine as a workaround (it seams that I found a useful use of cat):

cat file | while read line; do file=${line:0:10}; echo $file; done | uniq -c

A bug has been filed to GNU and Debian. Affected versions are bash 4.1.5 on Debian Squeeze 6.0.2 and 6.0.4.

echo ${BASH_VERSINFO[@]}
4 1 5 1 release x86_64-pc-linux-gnu

Update3:

Thanks to Andreas Schwab who reacted quickly to my bug report, this is the patch that is the solution to this misbehavior. The impacted file is lib/sh/zread.c as Gilles pointed out sooner:

diff --git a/lib/sh/zread.c b/lib/sh/zread.c index 0fd1199..3731a41 100644
--- a/lib/sh/zread.c
+++ b/lib/sh/zread.c @@ -161,7 +161,7 @@ zsyncfd (fd)
      int fd; {   off_t off;
-  int r;
+  off_t r;

  off = lused - lind;   r = 0;

The r variable is used to hold the return value of lseek. As lseek returns the offset from the beginning of the file, when it is over 2GB, the int value is negative, which causes the test if (r >= 0) to fail where it should have succeed.

Best Answer

You've found a bug in bash, of sorts. It's a known bug with a known fix.

Programs represent an offset in a file as a variable in some integer type with a finite size. In the old days, everyone used int for just about everything, and the int type was limited to 32 bits, including the sign bit, so it could store values from -2147483648 to 2147483647. Nowadays there are different type names for different things, including off_t for an offset in a file.

By default, off_t is a 32-bit type on a 32-bit platform (allowing up to 2GB), and a 64-bit type on a 64-bit platform (allowing up to 8EB). However, it's common to compile programs with the LARGEFILE option, which switches the type off_t to being 64 bits wide and makes the program call suitable implementations of functions such as lseek.

It appears that your're running bash on a 32-bit platform and your bash binary is not compiled with large file support. Now, when you read a line from a regular file, bash uses an internal buffer to read characters in batches for performance (for more details, see the source in builtins/read.def). When the line is complete, bash calls lseek to rewind the file offset back to the position of the end of the line, in case some other program cared about the position in that file. The call to lseek happens in the zsyncfc function in lib/sh/zread.c.

I haven't read the source in much detail, but I surmise that something is not happening smoothly at the point of transition when the absolute offset is negative. So bash ends up reading at the wrong offsets when it refills its buffer, after it's passed the 2GB mark.

If my conclusion is wrong and your bash is in fact running on a 64-bit platform or compiled with largefile support, that is definitely a bug. Please report it to your distribution or upstream.

A shell is not the right tool to process such large files anyway. It's going to be slow. Use sed if possible, otherwise awk.

Following update to question:

find -type d [...] "This would only find the directory. [...] How could this be solved in the most simple way?".

By supplying -type f to find to find all files (f), not all directories (d).

Bash – Read non-bash variables from a file into bash script

While you can transform this file to be a shell snippet, it's tricky. You need to make sure that all shell special characters are properly quoted.

The easiest way to do that is to put single quotes around the value and replace single quotes inside the value by '\''. You can then put the result into a temporary file and source that file.

script=$(mktemp)
sed <"config-file" >"$script" \
  -e '/^[A-Z_a-z][A-Z_a-z]*=/ !d' \
  -e s/\'/\'\\\\\'\'/g \
  -e s/=/=\'/ -e s/\$/\'/

I recommend doing the parsing directly in the shell instead. The complexity of the code is about the same, but there are two major benefits: you avoid the need for a temporary file, and the risk that you accidentally got the quoting wrong and end up executing a part of a line as a shell snippet (something like dangerous='$(run me)'). You also get a better chance at validating potential errors.

while IFS= read -r line; do
  line=${line%%#*}  # strip comment (if any)
  case $line in
    *=*)
      var=${line%%=*}
      case $var in
        *[!A-Z_a-z]*)
          echo "Warning: invalid variable name $var ignored" >&2
          continue;;
      esac
      if eval '[ -n "${'$var'+1}" ]'; then
        echo "Warning: variable $var already set, redefinition ignored" >&2
        continue
      fi
      line=${line#*=}
      eval $var='"$line"'
  esac
done <"config-file"

Best Answer

Related Solutions

Bash – How to Name a File in the Deepest Level of a Directory Tree

Following update to question:

Bash – Read non-bash variables from a file into bash script

Related Question