Shell Script – Efficiently Grep Interval of Sorted File

performanceshell-scripttext processing

My file has millions of lines, resides in memory /dev/shm/tmp.file, is accessed by multiple threads, looks like this

831092,25a1bd66f2eec71aa2f0a8bb3d,/path/to/a/file
4324,8d83c29e4d8c71bd66f1bd66fs,/path/to/another/file
...

and is sorted by the part after the second , with sort -t , -k3.
In general each line has the shape [0-9]*,[0-9a-z]*,.* and file paths can contain any characters except \0 or \n.

I need to extract the lines of all files that reside within a given directory as quickly as possible and without making an additional copy. Since the file is sorted that way, the lines I am looking for are an uninterrupted chunk of the file.

Currently I use grep -F ',<directory>' /dev/shm/tmp.file but I know it would be much quicker to do a binary search for the first hit and then expand the chunk line by line or with another binary search without reading in the entire file for each new line. However, this has to be integrated into a bash script, and I found no way to do something like lseek in the bash.

There is sgrep but it requires the complete lines to be sorted.

How can I extract all matches with ',<directory>' quicker than grep -F?

Edit: The input /dev/shm/tmp.file is only there to do this kind of extraction. Hence, pre-processing it in some way to make the job easier is an option.

Edit: a.b sorting between a and a/b is not an issue since all subdirectories should be included in the chunk.

Best Answer

If you changed 831092,25a1bd66f2eec71aa2f0a8bb3d,/path/to/a/file to /path/to/a/file,831092,25a1bd66f2eec71aa2f0a8bb3d

You could do it with:

look /path/to/ /dev/shm/tmp.file

look is a traditional Unix utility from the 70s, not specified by POSIX but fairly common. On Debian and derivatives, you'll find one in the bsdmainutils package, there's also one in util-linux (also copied from BSD, not on the Debian package by the same name).

look mmap()s the file and does a binary search.

However note that the Debian implementation reverts to a basic linear search a la grep unless you pass the -b option (sigh). So, on Debian or derivative, you'll want:

look -b /path/to/ /dev/shm/tmp.file

Also note that some implementations have a limit on the size of the file they can handle (see corresponding bug with patch for Debian's)

Related Question