I need to search for a string (a sequence of characters) in a file with a certain encoding, typically utf8, but return the character offsets (not byte offsets) of the results.
So this is a search that should be independent of the encoding of the string/file.
grep
apparently cannot do this, so which tool should I use?
Example (correct):
$ export LANG="en_US.UTF-8"
$ echo 'aöæaæaæa' | tool -utf8 'æa'
2
4
6
Example (wrong):
$ export LANG="en_US.UTF-8"
$ echo 'aöæaæaæa' | tool 'æa'
3
6
9
Best Answer
In current versions of Perl, you can use the
@-
and@+
magic arrays to get the positions of the matches of the whole regex and any possible capture groups. The zeroth element of both arrays holds the indexes related to the whole substring, so$-[0]
is the one you are interested in.As a one-liner:
Or a full script:
e.g.
(The latter script only works for stdin. I seem to trouble forcing Perl to treat all files as UTF-8.)