Bash – Start Reading a File from an Arbitrary Byte Count Offset

bashfilesystemstext processing

I want to locate a date which is somewhere in an 8 GB log (text).

Can I somewhat bypass a full sequential read, and first do binary splits of the file (size), or somehow navigating the filesystem inodes (which I know very little about), to start reading from each split point, until I find a suitable offset from where to start my text search for a line cotaining the date?

tail's read of the last line doesn't use a normal sequential read, so I wonder if this facility is somehow available in bash, or would I need to use Python or C/C++… but I am specifically interested in a bash option..

Best Answer

for (( block = 0; block < 16; block += 1 ))
do 
    echo $block; 
    dd if=INPUTFILE skip=$((block*512))MB bs=64 count=1 status=noxfer 2> /dev/null | \
        head -n 1
done

which .. creates no temp-split files, skips blocks * 512MB of data at each run, reads 64 bytes from that position and limits the output to the first line of that 64 bytes.

you might want to adjust 64 to whatever you think you need.

Related Solutions

Reading Lines from a File in Bash – for vs. while Loop Comparison

The for loop is fine here. But note that this is because the file contains machine names, which do not contain any whitespace characters or globbing characters. for x in $(cat file); do … does not work to iterate over the lines of file in general, because the shell first splits the output from the command cat file anywhere there is whitespace, and then treats each word as a glob pattern so \[?* are further expanded. You can make for x in $(cat file) safe if you work on it:

set -f
IFS='
'
for x in $(cat file); do …

Related reading: Looping through files with spaces in the names?; How can I read line by line from a variable in bash?; Why is while IFS= read used so often, instead of IFS=; while read..? Note that when using while read, the safe syntax to read lines is while IFS= read -r line; do ….

Now let's turn to what goes wrong with your while read attempt. The redirection from the server list file applies to the whole loop. So when ssh runs, its standard input comes from that file. The ssh client can't know when the remote application might want to read from its standard input. So as soon as the ssh client notices some input, it sends that input to the remote side. The ssh server there is then ready to feed that input to the remote command, should it want it. In your case, the remote command never reads any input, so the data ends up discarded, but the client side doesn't know anything about that. Your attempt with echo worked because echo never reads any input, it leaves its standard input alone.

There are a few ways you can avoid this. You can tell ssh not to read from standard input, with the -n option.

while read server; do
  ssh -n $server "uname -a"
done < /home/kenny/list_of_servers.txt

The -n option in fact tells ssh to redirect its input from /dev/null. You can do that at the shell level, and it'll work for any command.

while read server; do
  ssh $server "uname -a" </dev/null
done < /home/kenny/list_of_servers.txt

A tempting method to avoid ssh's input coming from the file is to put the redirection on the read command: while read server </home/kenny/list_of_servers.txt; do …. This will not work, because it causes the file to be opened again each time the read command is executed (so it would read the first line of the file over and over). The redirection needs to be on the whole while loop so that the file is opened once for the duration of the loop.

The general solution is to provide the input to the loop on a file descriptor other than standard input. The shell has constructs to ferry input and output from one descriptor number to another. Here, we open the file on file descriptor 3, and redirect the read command's standard input from file descriptor 3. The ssh client ignores open non-standard descriptors, so all is well.

while read server <&3; do
  ssh $server "uname -a"
done 3</home/kenny/list_of_servers.txt

In bash, the read command has a specific option to read from a different file descriptor, so you can write read -u3 server.

Get line number from byte offset

In your example,

byte number 8 is the second newline, not the 0 on the next line.

The following will give you the number of full lines after $b bytes:

$ dd if=data.in bs=1 count="$b" | wc -l

It will report 2 with b set to 8 and it will report 1 with b set to 7.

The dd utility, the way it's used here, will read from the file data.in, and will read $b blocks of size 1 byte.

As "icarus" rightly points out in the comments below, using bs=1 is inefficient. It's more efficient, in this particular case, to swap bs and count:

$ dd if=data.in bs="$b" count=1 | wc -l

This will have the same effect as the first dd command, but will read only one block of $b bytes.

The wc utility counts newlines, and a "line" in Unix is always terminated by a newline. So the above command will still say 2 if you set b to anything lower than 12 (the following newline). The result you are looking for is therefore whatever number the above pipeline reports, plus 1.

This will obviously also count the random newlines in the binary blob part of your file that precedes the ASCII text. If you knew where the ASCII bit starts, you could add skip="$offset" to the dd command, where $offset is the number of bytes to skip into the file.

Best Answer

Related Solutions

Reading Lines from a File in Bash – for vs. while Loop Comparison

Get line number from byte offset

Related Question