How to create a text file (1 gigabyte) containing random characters with UTF-8 character encoding

filesrandomtext;unicode

The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt

Best Answer

If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\4}
    no warnings "utf8";
    print chr(unpack("L>",$_) & 0x7fffffff)'

Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    print chr($c)'

If you only want assigned characters, you can pipe that to:

uconv -x '[:unassigned:]>;'

Or change that to:

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    $c = chr $c;
    print $c if $c =~ /\P{unassigned}/'

You may prefer:

             if $c =~ /[\p{Space}\p{Graph}]/ && $c !~ /\p{Co}/

To only get graphical and spacing ones (exclude those from the private-use sections).

Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.

Related Solutions

How to set VIM’s default encoding to UTF-8

When Vim reads an existing file, it tries to detect the file encoding. When writing out the file, Vim uses the file encoding that it detected (except when you tell it differently). So a file detected as UTF-8 is written as UTF-8, a file detected as Latin-1 is written as Latin-1, and so on.

By default, the detection process is crude. Every file that you open with Vim will be assumed to be Latin-1, unless it detects a Unicode byte-order mark at the top. A UTF-8 file without a byte-order mark will be hard to edit because any multibyte characters will be shown in the buffer as character sequences instead of single characters.

Worse, Vim, by default, uses Latin-1 to represent the text in the buffer. So a UTF-8 file with a byte-order mark will be corrupted by down-conversion to Latin-1.

The solution is to configure Vim to use UTF-8 internally. This is, in fact, recommended in the Vim documentation, and the only reason it is not configured that way out of the box is to avoid creating enormous confusion among users who expect Vim to operate basically as a Latin-1 editor.

In your .vimrc, add set encoding=utf-8 and restart Vim.

Or instead, set the LANG environment variable to indicate that UTF-8 is your preferred character encoding. This will affect not just Vim but any software which relies on LANG to determine how it should represent text. For example, to indicate that text should appear in English (en), as spoken in the United States (US), encoded as UTF-8 (utf-8), set LANG=en_US.utf-8.

Now Vim will use UTF-8 to represent the text in the buffer. Plus, it will also make a more determined effort to detect the UTF-8 encoding in a file. Besides looking for a byte-order mark, it will also check for UTF-8 without a byte-order mark before falling back to Latin-1. So it will no longer corrupt a file coded in UTF-8, and it should properly display the UTF-8 characters during the editing session.

For more information on how Vim detects the file encoding, see the fileencodings option in the Vim documentation.

For more information on setting the encoding that Vim uses internally, see the encoding option.

If you need to override the encoding used when writing a file back to disk, see the fileencoding option.

Character Encoding – Convert Non-ISO Extended-ASCII to UTF-8

file tells you “Non-ISO extended-ASCII text” because it detects that this is:

most likely a “text” file from the lack of control characters (byte values 0–31) other than line breaks;
“extended-ASCII” because there are characters outside the ASCII range (byte values ≥128);
“non-ISO” because there are characters in the 128–159 range (ISO 8859 reserves this range for control characters).

You have to figure out which encoding this file seems to be in. You can try Enca's automatic recognition. You might need to nudge it in the right direction by telling it in what language the text is.

enca x.txt
enca -L polish x.txt

To convert the file, pass the -x option: enca -L polish x.txt -x utf8 >x.utf8.txt

If you can't or don't want to use Enca, you can guess the encoding manually. A bit of looking around told me that this is Polish text and the words are trwały, stały, usuważ, so we're looking for a translation where ³ → ł and æ → ż. This looks like latin-2 or latin-10 or more likely (given “non-ISO” CP1250 which you're viewing as latin1. To convert the file to UTF-8, you can use recode or iconv.

recode CP1250..utf8 <x.txt >x.utf8.txt
iconv -f CP1250 -t UTF-8 <x.txt >x.utf8.txt

Best Answer

Related Solutions

How to set VIM’s default encoding to UTF-8

Character Encoding – Convert Non-ISO Extended-ASCII to UTF-8

Related Question