How to view cp1251 text file in UTF-8 console

character encodingconsolelesstext;

Attempt 1:

$ less subs.srt
"subs.srt" may be a binary file.  See it anyway? 
<C8><F2><E0><EB><FC><FF><ED> ...

Attempt 2:

$ LANG=ru_RU.CP1251 less subs.srt
����� �����, ��� ������.
��� ������� �������������! ...

Workaround:

$ iconv -f cp1251 < subs.srt | less

How do I do it conveniently?

Best Answer

To make less run in a different encoding from the terminal's, use luit (which ships with the X11 utility suite).

LANG=ru_RU.CP1251 luit less subs.srt

If you want to detect the encoding automatically, that's trickier, because a text file carries no indication of its encoding. The software Enca tries to recognize the encoding of a file based on its language:

$ enca -L russian subs.srt
MS-Windows code page 1251
$ iconv -f "$(enca -iL russian subs.srt)" | less

You can make this combination a LESSOPEN filter (see How can I view gzipped files in less without having to type zless? for an example). That may not give good results for text that isn't actually in Russian however.

If you only use UTF-8 and CP1251, you can fall back to CP1251 when a file isn't valid UTF-8 — there are “holes” in UTF-8 which cause most files in an 8-bit encoding not to be valid UTF-8. Proof-of-concept filter script for LESSOPEN (might not work on systems other than Linux, because it relies on head -c N reading exactly N bytes):

#!/bin/sh
head=$(head -c 1000)
if printf '%s\n' "$head" | grep -qav '^.*$'; then
  { printf '%s\n' "$head"; cat; } | iconv -f CP1251
else
  { printf '%s\n' "$head"; cat; }
fi

Related Solutions

Force less to display a file as text

I think you have (or your distribution has) a LESSOPEN filter set up for less. Try the following to tell less to not use the filter:

less -L my_binary_file

For further exploration, also try echo $LESSOPEN. It probably contains the name of a shell script (/usr/bin/lesspipe for me), which you can read through to see what sort of filters there are. Also try man less, and read the Input Preprocessor section.

Text – View File Containing DOS Text and Escape Sequences

That's MSDOS charset.

Try recode cp437..u8 in a UTF8 terminal.

It gives:

██▀▀▀▀▀▀ ██▀▀▀▀▀█  █▀▀▀▀▀█ ██▀▀█▀▀█ ██       █▀▀▀▀▀█ ██▀▀█ ██ ██▀▀▀▀▀▄
██▄▄▄▄▄▄ ██▄▄▄▄▄█  █▄▄▄▄▄█ ██ ██ ██ ██       █▄▄▄▄▄█ ██ ██ ██ ██    ██
      ▄█ ██        █    ▄█ ██    ██ ██       █    ▄█ ██ ██ ██ ██    ██
▄▄▄▄▄▄▄█ ██        █     █ ██    ██ ██▄▄▄▄▄  █     █ ██ ██▄▄█ ██▄▄▄▄▄▀

in colour.

Best Answer

Related Solutions

Force less to display a file as text

Text – View File Containing DOS Text and Escape Sequences

Related Question