How to Find Character Locations of a String in a File

character encodinggrepsearchstring

I need to search for a string (a sequence of characters) in a file with a certain encoding, typically utf8, but return the character offsets (not byte offsets) of the results.

So this is a search that should be independent of the encoding of the string/file.

grep apparently cannot do this, so which tool should I use?

Example (correct):

$ export LANG="en_US.UTF-8" 
$ echo 'aöæaæaæa' | tool -utf8 'æa'
2
4
6

Example (wrong):

$ export LANG="en_US.UTF-8"
$ echo 'aöæaæaæa' | tool 'æa'
3
6
9

Best Answer

In current versions of Perl, you can use the @- and @+ magic arrays to get the positions of the matches of the whole regex and any possible capture groups. The zeroth element of both arrays holds the indexes related to the whole substring, so $-[0] is the one you are interested in.

As a one-liner:

$ echo 'aöæaæaæa' | perl -CSDLA -ne 'BEGIN { $pattern = shift }; printf "%d\n", $-[0] while $_ =~ m/$pattern/g;'  æa
2
4
6

Or a full script:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use Encode;
use open  ":encoding(utf8)";
undef $/;
my $pattern = decode_utf8(shift);
binmode STDIN, ":utf8";
while (<STDIN>) {
    printf "%d\n", $-[0] while $_ =~ m/$pattern/g;
}

e.g.

$ echo 'aöæaæaæa' | perl match.pl æa -
2
4
6

(The latter script only works for stdin. I seem to trouble forcing Perl to treat all files as UTF-8.)