search unicode text – Why Searching Through UTF-16 Encoded Text Files is Difficult

searchtext;unicode

I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.

If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.

Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?

I'm on Xubuntu.

Best Answer

UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.

That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.

Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.

Related Question