How to do a regex search in a UTF-16LE file while in a UTF-8 locale

grepperlregular expressiontext processingunicode

EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,

Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE \x09\x0A is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ऊ? … Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.

Here is sample pattern for this 3-character string ऊपर:

\x09\x0A\x09\x2A\x09\x30

but it returns nothing, though the string is in the file.

(here is the original post)
When searching a UTF-16LE file with a pattern in \x00\x01\x...etc format, I have encountered problems for some values. I've been using sed (and experimented with grep), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.

eg. In this text ऊ (UNICODE 090A), though it is a single character, ऊ is perceived as two ASCII chars \x09 and \x0A.

grep has a -P (perl) option which can search for \x00\x... patterns, but I'm getting the same ASCII interpretation.

Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.

grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.

PS; My ऊ example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16… I'd prefer to not have to open and close the file… I think perl has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.

Best Answer

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

How to quote special characters (portably)

The following snippet adds a backslash before each character that's special in extended regular expressions, using sed to replace any occurence of one of the characters ][()\.^$?*+ by a backslash followed by that character:

raw_string='test[string]\.wibble'
quoted_string=$(printf %s "$raw_string" | sed 's/[][()\.^$?*+]/\\&/g')

This will remove trailing newlines in $raw_string; if that's a problem, ensure that the string doesn't end with a newline by adding an inert character at the end, then strip off that character.

quoted_string=$(printf %sa "$raw_string" | sed 's/[][()\.^$?*+]/\\&/g')
quoted_string=${quoted_string%?}

How to quote special characters (in bash or zsh)

Bash and zsh have a pattern replacement feature, which can be faster if the string is not very long. It's cumbersome here because the replacement must be a string, so each character needs to be replaced separately. Note that you must escape the backslashes first.

quoted_string=${raw_string//\\//\\\\}
for c in \[ \] \( \) \. \^ \$ \? \* \+; do
  quoted_string=${quoted_string//"$c"/"\\$c"}
done

How to quote special characters (in ksh93)

Ksh's string replacement construct is more powerful than the watered-down version in bash and zsh. It supports references to groups in the pattern.

quoted_string=${raw_string//@([][()\.^$?*+])/\\\1}

What you actually want

You don't need find here: shell patterns are sufficient to match files ending with three digits. If no part file exists, the glob pattern is left unexpanded. There's also a simpler way of adding the file sizes: rather than use stat (which exists on many unix variants but has a different syntax on each) and do complex pipelining to sum the values, you can call wc -c (on regular files, on most systems, wc will look at the file size and not bother to open the file and read the bytes).

set -- "$DESTINATION/$FILE_BASENAME".[0-9][0-9][0-9]
case $1 in
  *\]) # The glob was left intact, so no part exists
    do_split …;;
  *) # The glob was expanded, so at least one part exists
    FILE_SIZE_EXISTING=$(wc -c "$@" | sed -n '$s/[^0-9]//gp')
    if [ "$FILE_SIZE_EXISTING" -ne "$(wc -c <"$DESTINATION/$FILE_BASENAME")" ]; then
      do_split …
    fi

Note that your test on the total size is not very reliable: if the file has changed but remained the same size, you'll end up with stale parts. That's ok if the files never change and the only risk is that parts may be truncated or missing.

Best Answer

Related Solutions

Linux Command Line – Tools for Multi-Line Regex Expressions

Escaping of meta characters in basic/extended posix regex strings in grep

How to quote special characters (portably)

How to quote special characters (in bash or zsh)

How to quote special characters (in ksh93)

What you actually want

Related Question