search unicode text – Why Searching Through UTF-16 Encoded Text Files is Difficult

searchtext;unicode

I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.

If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.

Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?

I'm on Xubuntu.

Best Answer

UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.

That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.

Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.

Related Solutions

Convert .docx to Plain Text – Preserve Line Breaks and References

docx2txt works on the information in the docx file which is a zipped set of XML files.

With regards to line wrapping the .docx XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.

So docx2txt or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².

If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result⁴.

Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.

¹ _{This sort of incompatibility is the primary reason some of my friends cannot change to Linux from Windows, although they would like to. They have to use MicroSoft Word, as Open/LibreOffice every once in a while mangles texts they exchange with clients.}
² _{You can install all the fonts used in the Word files and might get lucky for some texts, some of the time.}
³ _{Rendering PDFs from .doc/.docx}
⁴ _{The program uses GUI automation—as if someone is clicking its menus—and doesn't attempt to drive Word via an API. I am pretty sure the latter can be done as well and would have the advantage of not breaking things if Word would get upgraded}

How to create a text file (1 gigabyte) containing random characters with UTF-8 character encoding

If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\4}
    no warnings "utf8";
    print chr(unpack("L>",$_) & 0x7fffffff)'

Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    print chr($c)'

If you only want assigned characters, you can pipe that to:

uconv -x '[:unassigned:]>;'

Or change that to:

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    $c = chr $c;
    print $c if $c =~ /\P{unassigned}/'

You may prefer:

             if $c =~ /[\p{Space}\p{Graph}]/ && $c !~ /\p{Co}/

To only get graphical and spacing ones (exclude those from the private-use sections).

Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.

Best Answer

Related Solutions

Convert .docx to Plain Text – Preserve Line Breaks and References

How to create a text file (1 gigabyte) containing random characters with UTF-8 character encoding

Related Question