Text Processing – Binary Search in a Sorted Text File

searchtext processing

I have a big sorted file with billions of lines of variable lengths. Given a new line I would like to know which byte number it would get if it had been included in the sorted file.

Example

a\n
c\n
d\n
f\n
g\n

Given the input 'foo' I would get the output 9.

This is easy to do by simply going through the whole file, but being billions of lines of variable lengths it would be faster to do a binary search.

Does such a text processing tool already exist?

Edit:

It does now: https://gitlab.com/ole.tange/tangetools/blob/master/2search

Best Answer

I'm not aware of some standard tool doing this. However you can write your own. For example the following ruby script should do the job.

file, key = ARGV.shift, ARGV.shift
min, max = 0, File.size(file)

File.open(file) do |f|
  while max-min>1 do
    middle = (max+min)/2
    f.seek middle
    f.readline
    if f.eof? or f.readline>=key
      max = middle
    else
      min = middle
    end
  end
  f.seek max
  f.readline
  p f.pos+1
end

It's a bit tricky because after the seek you are usually in the middle of some line and therefore need to do one readline to get to the beginning of the following line, which you can read and compare to your key.

Related Question