What does [[.ch.]] mean in a regex

posixregular expressionterminology

Alternate title: What is a "collating sequence" or "collating element" in a POSIX-compliant regex?

I found the exact technical definition in Section 9.3.5 of the POSIX specs, as item #4 in the list, but it's not really clear to me.

I googled around on the web for examples and explanations and came up not completely empty-handed, but definitely not enlightened.

The only thing I've sort of gotten is that in certain circumstances, you can make your regex treat multiple characters as though they were a single character for purposes of length comparison and determining what the "longest match" is (since regexes are greedy and return the longest possible match).

Is that all, though? I'm having trouble seeing a use for it, but I suspect my understanding is incomplete. What actually is "collating" for a regex? And how does [[.ch.]], the example in the POSIX specs, relate to this?

Best Answer

Collation elements are usually referenced in the context of sorting.

In many languages, collation (sorting like in a dictionary) is not only done per-character. For instance, in Czech, ch doesn't sort between cg and ci like it would in English, but is considered as a whole for sorting. It is a collating element (we can't refer to a character here, character are a subset of collating elements) that sorts in between h and i.

Now you may ask, What has that to do with regular expressions?, Why would I want to refer to a collating element in a bracket expression?.

Well, inside bracket expressions, one does use order. For instance in [c-j], you want the characters in between c and j. Well, do you? You'd rather want collating elements there. [h-i] in a Czech locale matches ch:

$ echo cho | LC_ALL=cs_CZ.UTF-8 grep '^[h-i]o'
cho

So, if you're able to list a range of collating elements in a bracket expression, then you'd expect to be able to list them individually as well. [a-cch] would match that collating elements in between a and c and the c and h characters. To have a-c and the ch collating element, we need a new syntax:

$ echo cho | LC_ALL=cs_CZ.UTF-8 grep '^[a-c[.ch.]]o'
cho

(the ones in between a and c and the ch one).

Now, the world is not perfect yet and probably never will. The example above was on a GNU system and worked. Another example of a collating element could be e with a combining acute accent in UTF-8 ($'e\u0301' rendered like $'\u00e9' as é).

é and é are the same character except that one is represented with one character and the other one with two.

$ echo $'e\u301t\ue9' | grep '^[d-f]t'

Will work properly on some systems but not others (not GNU ones for instance). And it's unclear whether $'[[.\ue9.]]' should match only $'\ue9' or both $'\ue9' and $'e\u301'.

Not to mention non-alphabetic scripts, or scripts with different, regional, sorting orders, things like ffi (ffi in one character) which become tricky to handle with such a simple API.

Related Question