Sed Match character range

sedunicode

Is there a way to match some Unicode range exactly.
Let's use the Cyrillic range as an example: U+400 to U+52f

The whole range of chars could be printed (from bash or zsh) with:


$ echo -e $(printf '\\U%x' $(seq 0x400 0x52f))
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҇ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧԨԩԪԫԬԭԮԯ

$ a=$(zsh -c 'echo -e $(printf '\''\\U%x'\'' $(seq 0x400 0x52f))')

To filter some range of it, lets use 0x452 to 0x490, this is the expected output:

$ b=$(bash -c 'echo -e $(printf '\''\\U%x'\'' $(seq 0x452 0x490))')
$ echo "$b"
ђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҇ҊҋҌҍҎҏҐ
$ echo "$b" | xxd
00000000: d192 d193 d194 d195 d196 d197 d198 d199  ................
00000010: d19a d19b d19c d19d d19e d19f d1a0 d1a1  ................
00000020: d1a2 d1a3 d1a4 d1a5 d1a6 d1a7 d1a8 d1a9  ................
00000030: d1aa d1ab d1ac d1ad d1ae d1af d1b0 d1b1  ................
00000040: d1b2 d1b3 d1b4 d1b5 d1b6 d1b7 d1b8 d1b9  ................
00000050: d1ba d1bb d1bc d1bd d1be d1bf d280 d281  ................
00000060: d282 d283 d284 d285 d286 d287 d288 d289  ................
00000070: d28a d28b d28c d28d d28e d28f d290 0a    ...............

But it seems impossible to filter with sed. This doesn't work:

$ echo "$a" | sed 's/[^\x452-\x490]//g'

Nor this (the result match other characters (probably a collating issue)):

$ echo "$a" | sed $'s/[^\u452-\u490]//g'
АБВГжзийклмнопрстуфхцчшщъыьэюяёђєѕіїјљњћќѝўџҋҍҏҐҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎӐӒӔӝӟӡӣӥӧөӫӭӯӱӳӵӹԅԇԉԋԍԏ

Not even this (same collating issue):

$ echo "$a" | sed 's/[^ђ-Ґ]//g'

This work with awk:

$ echo "$a" | awk '{gsub(/[^ђ-Ґ]/,"")}1'

But the only way to use an hex range is to use shell to convert hex to an unicode character

$ echo "$a" | awk $'{gsub(/[^\u452-\u490]/,"")}1'

or (two solutions):

$ c=$(bash -c 'printf "\u452-\u490"') 
$ echo "$a" | awk '{gsub(/[^'"$c"']/,"")}1'
$ echo $a | awk -v ra="[^$c]" '{gsub(ra,"")}1'

Questions:

  • Is there a way to do this with sed?
  • Could awk do it in hex numbers without a higher shell.

  • If possible, what is exactly the range matched by the collating sequence that sed use with sed 's/[^ђ-Ґ]//g'.

P.S.: I know that could be done in perl, thanks.

Best Answer

Per POSIX, ranges in bracket expressions are only specified to be based on code point in the C/POSIX locale. In other locales, it's unspecified and is often somewhat based on the collation order as you found out. You'll find that in some locales, depending on the tool, [g-j] for instance includes i but also ı, ǵ, sometimes even I or even ch like in some Czech locales.

zsh is one of those rare ones whose [x-y] ranges are based on code point regardless of the locale. For single-byte character sets, that will be based on byte value, for multi-byte ones on Unicode code point or whatever the system uses to represent wide characters internally with the mbstowc() and co. APIs (generally Unicode).

So in zsh,

  • [[ $char = [$'\u452'-$'\u490'] ]]
  • [[ $char = [^ђ-Ґ] ]]
  • y=${x//[^ђ-Ґ]/}

would work in your case to match on characters in that Unicode range provided the locale's charset is multi-byte and has those two characters. There are single-byte charsets that contain some of those characters (like ISO8859-5 that has most of the ones in U+0401 .. U+045F), but in locales that use those, the [ђ-Ґ] ranges would be based on the byte value (code point in the corresponding charset, not Unicode codepoint).

In the C locale, ranges are based on code point, but the charset in the C locale is only guaranteed to include the characters in the portable character set which is just the few characters necessary to write POSIX or C code (none of which is in the Cyrillic script). It is also guaranteed to be single-byte so cannot possibly include all the characters specified in Unicode. In practice, it is most often ASCII.

In practice you cannot set LC_COLLATE to C without also setting LC_CTYPE to C (or at least a locale with a single-byte charset). Many systems however have a C.UTF-8 locale which you could use here.

UTF-8 is one of those character sets that can represent all the Unicode characters, and so all those in any charset. So you could do:

< file iconv -t utf-8 |
  LC_ALL=C.UTF-8 sh -c 'sed "$(printf "s/[^\321\222-\322\220]//g")"' |
  iconv -f utf-8

The first iconv converting from the user's locale charset to UTF-8, \321\222 and \322\220 being the UTF-8 encoding of U+0452 and U+0490 respectively, the second iconv converting back to the locale's charset.

If the current locale already uses UTF-8 as the charset (and file is written using that charset), that can be simplified to:

<file LC_ALL=C.UTF-8 sed 's/[^ђ-Ґ]//g'

or:

<file LC_ALL=C.UTF-8 sed "$(printf "s/[^\321\222-\322\220]//g")"

With GNU sed and provided $POSIXLY_CORRECT is not in the environment, you can specify characters based on the value of bytes of their encoding.

<file LC_ALL=C.UTF-8 sed 's/[^\321\222-\322\220]//g'

Though in older versions you may need:

<file LC_ALL=C.UTF-8 sed 's/[^\o321\o222-\o322\o220]//g'

Or the hexadecimal variant:

<file LC_ALL=C.UTF-8 sed 's/[^\xd1\x92-\xd2\x90]//g'

Another option, for locales using a multi-byte character set that includes those characters on systems where the wide character representation is based on Unicode is to use GNU awk and:

awk 'BEGIN{for (i = 0x452; i<=0x490; i++) range = range sprintf("%c", i)}
     {gsub("[^" range "]", ""); print}'

(Initially, I believed POSIX required awk implementations to behave like GNU awk, but that's not the case, as POSIX leaves the behaviour of sprintf("%c", i) undefined for values of i that don't correspond to the encoding (not codepoint) of a character in the locale. Which means it can't be used portably for multi-byte characters).

In any case, note that the U+0400 .. U+052F range are not the only Unicode characters in the Cyrillic script, let alone languages that use Cyrillic as their script. The list of characters also varies with the version of Unicode.

On a Debian-like system, you can get a list of them with:

unicode --max 0 cyrillic

(which gives 435 different ones on Ubuntu 16.04, 444 on Debian sid (probably using a different version of Unicode).

In perl, see \p{Block: Cyrillic}, \p{Block: Cyrillic_Ext_A,B,C}, \p{Block: Cyrillic_Supplement}... to match on Unicode blocks and \p{Cyrillic} to match on characters of the Cyrillic script (currently assigned in the Unicode version that your version of perl is using (see perl -MUnicode::UCD -le 'print Unicode::UCD::UnicodeVersion' for instance)).

So:

perl -Mopen=locale 's/\P{Cyrillic}//g'
Related Question