How to view cp1251 text file in UTF-8 console

character encodingconsolelesstext;

Attempt 1:

$ less subs.srt
"subs.srt" may be a binary file.  See it anyway? 
<C8><F2><E0><EB><FC><FF><ED> ...

Attempt 2:

$ LANG=ru_RU.CP1251 less subs.srt
����� �����, ��� ������.
��� ������� �������������! ...

Workaround:

$ iconv -f cp1251 < subs.srt | less

How do I do it conveniently?

Best Answer

To make less run in a different encoding from the terminal's, use luit (which ships with the X11 utility suite).

LANG=ru_RU.CP1251 luit less subs.srt

If you want to detect the encoding automatically, that's trickier, because a text file carries no indication of its encoding. The software Enca tries to recognize the encoding of a file based on its language:

$ enca -L russian subs.srt
MS-Windows code page 1251
$ iconv -f "$(enca -iL russian subs.srt)" | less

You can make this combination a LESSOPEN filter (see How can I view gzipped files in less without having to type zless? for an example). That may not give good results for text that isn't actually in Russian however.

If you only use UTF-8 and CP1251, you can fall back to CP1251 when a file isn't valid UTF-8 — there are “holes” in UTF-8 which cause most files in an 8-bit encoding not to be valid UTF-8. Proof-of-concept filter script for LESSOPEN (might not work on systems other than Linux, because it relies on head -c N reading exactly N bytes):

#!/bin/sh
head=$(head -c 1000)
if printf '%s\n' "$head" | grep -qav '^.*$'; then
  { printf '%s\n' "$head"; cat; } | iconv -f CP1251
else
  { printf '%s\n' "$head"; cat; }
fi
Related Question