How to create a text file (1 gigabyte) containing random characters with UTF-8 character encoding

filesrandomtext;unicode

The following command does not use UTF-8: head -c 1M </dev/urandom >myfile.txt

Best Answer

If you want UTF-8 encodings of code points 0 to 0x7FFFFFFF (which the UTF-8 encoding algorithm was originally designed to work on):

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\4}
    no warnings "utf8";
    print chr(unpack("L>",$_) & 0x7fffffff)'

Nowadays, Unicode is restricted to 0..D7FF, E000..10FFFF (though some of those characters are not assigned, some of which will never be (are defined as non-characters)).

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    print chr($c)'

If you only want assigned characters, you can pipe that to:

uconv -x '[:unassigned:]>;'

Or change that to:

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    $c = chr $c;
    print $c if $c =~ /\P{unassigned}/'

You may prefer:

             if $c =~ /[\p{Space}\p{Graph}]/ && $c !~ /\p{Co}/

To only get graphical and spacing ones (exclude those from the private-use sections).

Now, to get 1GiB of that, you can pipe it to head -c1G (assuming GNU head), but beware the last character may be cut in the middle.