Shell – How to (Memory Limited) > grep -F -f file_A file_B >> output.txt

greplinuxscriptingshell-scripttext processing

file_A (~500MB, 1.6M lines) consists of all equal length search terms, 1 per line, not sorted.

file_B consists of all equal length text lines, 1 per line, not sorted

I've been able to run "grep -F -f file_A file_B >> output.txt" with any size file_B without problem on a box with 52GB ram. Problem is I'm now limited to 4GB ram and thus the size of file_A is now too large for this to run without exhausting available memory.

Short of manually chopping up file_A into smaller bites, is there any easy way to script this to grep for first 1000 lines of file_A, then when thats finished to automatically grep for lines 1001-2000, ect. until I've gone through all of file_A?

Best Answer

Loop through chunks of file_A, sending them as stdin to the same grep statement; adjust 1000 to your available memory:

nlines=$(wc -l < file_A)
chunk=1000
for((i=1; i < nlines; i += chunk)) 
do 
  sed -n $i,+$((chunk - 1))p file_A | grep -F -f - file_B
done > output

Related Solutions

Grep Memory Exhausted – How to Fix

Two potential problems:

grep -R (except for the modified GNU grep found on OS/X 10.8 and above) follows symlinks, so even if there's only 100GB of files in ~/Documents, there might still be a symlink to / for instance and you'll end up scanning the whole file system including files like /dev/zero. Use grep -r with newer GNU grep, or use the standard syntax:
```
find ~/Documents -type f -exec grep Milledgeville /dev/null {} +
```
(however note that the exit status won't reflect the fact that the pattern is matched or not).
grep finds the lines that match the pattern. For that, it has to load one line at a time in memory. GNU grep as opposed to many other grep implementations doesn't have a limit on the size of the lines it reads and supports search in binary files. So, if you've got a file with a very big line (that is, with two newline characters very far appart), bigger than the available memory, it will fail.

That would typically happen with a sparse file. You can reproduce it with:
```
truncate -s200G some-file
grep foo some-file
```
That one is difficult to work around. You could do it as (still with GNU grep):
```
find ~/Documents -type f -exec sh -c 'for i do
  tr -s "\0" "\n" < "$i" | grep --label="$i" -He "$0"
  done' Milledgeville {} +
```
That converts sequences of NUL characters into one newline character prior to feeding the input to grep. That would cover for cases where the problem is due to sparse files.

You could optimise it by doing it only for large files:
```
find ~/Documents -type f $ -size -100M -exec \
  grep -He Milledgeville {} + -o -exec sh -c 'for i do
  tr -s "\0" "\n" < "$i" | grep --label="$i" -He "$0"
  done' Milledgeville {} + $
```
If the files are not sparse and you have a version of GNU grep prior to 2.6, you can use the --mmap option. The lines will be mmapped in memory as opposed to copied there, which means the system can always reclaim the memory by paging out the pages to the file. That option was removed in GNU grep 2.6

Best Answer

Related Solutions

Grep Memory Exhausted – How to Fix

Related Question