How to do a regex search in a UTF-16LE file while in a UTF-8 locale

grepperlregular expressiontext processingunicode

EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,

Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE \x09\x0A is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ? … Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.

Here is sample pattern for this 3-character string ऊपर:

  • \x09\x0A\x09\x2A\x09\x30

    but it returns nothing, though the string is in the file.

(here is the original post)
When searching a UTF-16LE file with a pattern in \x00\x01\x...etc format, I have encountered problems for some values. I've been using sed (and experimented with grep), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.

eg. In this text (UNICODE 090A), though it is a single character, is perceived as two ASCII chars \x09 and \x0A.

grep has a -P (perl) option which can search for \x00\x... patterns, but I'm getting the same ASCII interpretation.

Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.

grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.

PS; My example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16… I'd prefer to not have to open and close the file… I think perl has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.

Best Answer

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

Related Question