How to Find Character Locations of a String in a File

character encodinggrepsearchstring

I need to search for a string (a sequence of characters) in a file with a certain encoding, typically utf8, but return the character offsets (not byte offsets) of the results.

So this is a search that should be independent of the encoding of the string/file.

grep apparently cannot do this, so which tool should I use?

Example (correct):

$ export LANG="en_US.UTF-8" 
$ echo 'aöæaæaæa' | tool -utf8 'æa'
2
4
6

Example (wrong):

$ export LANG="en_US.UTF-8"
$ echo 'aöæaæaæa' | tool 'æa'
3
6
9

Best Answer

In current versions of Perl, you can use the @- and @+ magic arrays to get the positions of the matches of the whole regex and any possible capture groups. The zeroth element of both arrays holds the indexes related to the whole substring, so $-[0] is the one you are interested in.

As a one-liner:

$ echo 'aöæaæaæa' | perl -CSDLA -ne 'BEGIN { $pattern = shift }; printf "%d\n", $-[0] while $_ =~ m/$pattern/g;'  æa
2
4
6

Or a full script:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use Encode;
use open  ":encoding(utf8)";
undef $/;
my $pattern = decode_utf8(shift);
binmode STDIN, ":utf8";
while (<STDIN>) {
    printf "%d\n", $-[0] while $_ =~ m/$pattern/g;
}

e.g.

$ echo 'aöæaæaæa' | perl match.pl æa -
2
4
6

(The latter script only works for stdin. I seem to trouble forcing Perl to treat all files as UTF-8.)

Related Solutions

Grep – How to Search Text Throughout Entire File System

I normally use this style of command to run grep over a number of files:

find / -xdev -type f -print0 | xargs -0 grep -H "800x600"

What this actually does is make a list of every file on the system, and then for each file, execute grep with the given arguments and the name of each file.

The -xdev argument tells find that it must ignore other filesystems - this is good for avoiding special filesystems such as /proc. However it will also ignore normal filesystems too - so if, for example, your /home folder is on a different partition, it won't be searched - you would need to say find / /home -xdev ....

-type f means search for files only, so directories, devices and other special files are ignored (it will still recurse into directories and execute grep on the files within - it just won't execute grep on the directory itself, which wouldn't work anyway). And the -H option to grep tells it to always print the filename in its output.

find accepts all sorts of options to filter the list of files. For example, -name '*.txt' processes only files ending in .txt. -size -2M means files that are smaller than 2 megabytes. -mtime -5 means files modified in the last five days. Join these together with -a for and and -o for or, and use '(' parentheses ')' to group expressions (in quotes to prevent the shell from interpreting them). So for example:

find / -xdev '(' -type f -a -name '*.txt' -a -size -2M -a -mtime -5 ')' -print0 | xargs -0 grep -H "800x600"

Take a look at man find to see the full list of possible filters.

Grep and Escaping a Dollar Sign

There's 2 separate issues here.

grep uses Basic Regular Expressions (BRE), and $ is a special character in BRE's only at the end of an expression. The consequence of this is that the 2 instances of $ in $Id$ are not equal. The first one is a normal character and the second is an anchor that matches the end of the line. To make the second $ match a literal $ you'll have to backslash escape it, i.e. $Id\$ . Escaping the first $ also works: \$Id\$, and I prefer this since it looks more consistent.¹
There are two completely unrelated escaping/quoting mechanisms at work here: shell quoting and regex backslash quoting. The problem is many characters that regular expressions use are special to the shell as well, and on top of that the regex escape character, the backslash, is also a shell quoting character. This is why you often see messes involving double backslashes, but I do not recommend using backslashes for shell quoting regular expressions because it is not very readable.

Instead, the simplest way to do this is to first put your entire regex inside single quotes as in 'regex'. The single quote is the strongest form of quoting the shell has, so as long as your regex does not contain single quotes, you no longer have to worry about shell quoting and can focus on pure BRE syntax.

So, applying this back to your original example, let's throw the correct regex (\$Id\$) inside single quotes. The following should do what you want:

grep '\$Id\$' my_dir/my_file

The reason \$Id\$ does not work is because after shell quote removal (the more correct way of saying shell quoting) is applied, the regex that grep sees is $Id$ . As explained in (1.), this regex matches a literal $Id only at the end of a line because the first $ is literal while the second is a special anchor character.

^{¹ Note also that if you ever switch to Extended Regular Expressions (ERE), e.g. if you decided to use egrep (or grep -E), the $ character is always special. In ERE's $Id$ would never match anything because you can't have characters after the end of a line, so \$Id\$ would be the only way to go.}

Best Answer

Related Solutions

Grep – How to Search Text Throughout Entire File System

Grep and Escaping a Dollar Sign

Related Question