bash – Bash Scripting and Large Files (Bug): Input with the Read Builtin from a Redirection Gives Unexpected Result

bash

I have a strange issue with large files and bash. This is the context:

  • I have a large file: 75G and 400,000,000+ lines (it is a log file, my bad, I let it grow).
  • The first 10 chars of each line is a time stamps in the format YYYY-MM-DD.
  • I want to split that file: one file per day.

I tried with the following script that did not work. My question is about this script not working, not alternative solutions.

while read line; do
  new_file=${line:0:10}_file.log
  echo "$line" >> $new_file
done < file.log

After debugging, I found the problem in the new_file variable. This script:

while read line; do
  new_file=${line:0:10}_file.log
  echo $new_file
done < file.log | uniq -c

gives the result bellow (I put the xes to keep the data confidential, other chars are the real ones). Notice the dh and the shorter strings:

...
  27402 2011-xx-x4
  27262 2011-xx-x5
  22514 2011-xx-x6
  17908 2011-xx-x7
...
3227382 2011-xx-x9
4474604 2011-xx-x0
1557680 2011-xx-x1
      1 2011-xx-x2
      3 2011-xx-x1
...
     12 2011-xx-x1
      1 2011-xx-dh
      1 2011-xx-x1
      1 208--
      1 2011-xx-x1
      1 2011-xx-dh
      1 2011-xx-x1    
...

It is not a problem in the format of my file. The script cut -c 1-10 file.log | uniq -c gives only valid time stamps. Interestingly, a part of the the above output becomes with cut ... | uniq -c:

3227382 2011-xx-x9
4474604 2011-xx-x0
5722027 2011-xx-x1

We can see that after the uniq count 4474604, my initial script failed.

Did I hit a limit in bash that I do not know, did I find a bug in bash (it seams unlikely), or have I done something wrong ?

Update:

The problem happens after reading 2G of the file. It seams read and redirection do not like larger files than 2G. But still searching for a more precise explanation.

Update2:

It definitively looks like a bug. It can be reproduced with:

yes "0123456789abcdefghijklmnopqrs" | head -n 100000000 > file
while read line; do file=${line:0:10}; echo $file; done < file | uniq -c

but this works fine as a workaround (it seams that I found a useful use of cat):

cat file | while read line; do file=${line:0:10}; echo $file; done | uniq -c 

A bug has been filed to GNU and Debian. Affected versions are bash 4.1.5 on Debian Squeeze 6.0.2 and 6.0.4.

echo ${BASH_VERSINFO[@]}
4 1 5 1 release x86_64-pc-linux-gnu

Update3:

Thanks to Andreas Schwab who reacted quickly to my bug report, this is the patch that is the solution to this misbehavior. The impacted file is lib/sh/zread.c as Gilles pointed out sooner:

diff --git a/lib/sh/zread.c b/lib/sh/zread.c index 0fd1199..3731a41 100644
--- a/lib/sh/zread.c
+++ b/lib/sh/zread.c @@ -161,7 +161,7 @@ zsyncfd (fd)
      int fd; {   off_t off;
-  int r;
+  off_t r;

  off = lused - lind;   r = 0;

The r variable is used to hold the return value of lseek. As lseek returns the offset from the beginning of the file, when it is over 2GB, the int value is negative, which causes the test if (r >= 0) to fail where it should have succeed.

Best Answer

You've found a bug in bash, of sorts. It's a known bug with a known fix.

Programs represent an offset in a file as a variable in some integer type with a finite size. In the old days, everyone used int for just about everything, and the int type was limited to 32 bits, including the sign bit, so it could store values from -2147483648 to 2147483647. Nowadays there are different type names for different things, including off_t for an offset in a file.

By default, off_t is a 32-bit type on a 32-bit platform (allowing up to 2GB), and a 64-bit type on a 64-bit platform (allowing up to 8EB). However, it's common to compile programs with the LARGEFILE option, which switches the type off_t to being 64 bits wide and makes the program call suitable implementations of functions such as lseek.

It appears that your're running bash on a 32-bit platform and your bash binary is not compiled with large file support. Now, when you read a line from a regular file, bash uses an internal buffer to read characters in batches for performance (for more details, see the source in builtins/read.def). When the line is complete, bash calls lseek to rewind the file offset back to the position of the end of the line, in case some other program cared about the position in that file. The call to lseek happens in the zsyncfc function in lib/sh/zread.c.

I haven't read the source in much detail, but I surmise that something is not happening smoothly at the point of transition when the absolute offset is negative. So bash ends up reading at the wrong offsets when it refills its buffer, after it's passed the 2GB mark.

If my conclusion is wrong and your bash is in fact running on a 64-bit platform or compiled with largefile support, that is definitely a bug. Please report it to your distribution or upstream.

A shell is not the right tool to process such large files anyway. It's going to be slow. Use sed if possible, otherwise awk.

Related Question