How can I truncate a (UTF-8 encoded) text file to given number of characters? I don't care about line lengths and the cut can be in the middle of word.
cut
seems to operate on lines, but I want a whole file.head -c
uses bytes, not characters.
text processing
How can I truncate a (UTF-8 encoded) text file to given number of characters? I don't care about line lengths and the cut can be in the middle of word.
cut
seems to operate on lines, but I want a whole file.head -c
uses bytes, not characters.
Best Answer
Some systems have a
truncate
command that truncates files to a number of bytes (not characters).I don't know of any that truncate to a number of characters, though you could resort to
perl
which is installed by default on most systems:perl
With
-Mopen=locale
, we use the locale's notion of what characters are (so in locales using the UTF-8 charset, that's UTF-8 encoded characters). Replace with-CS
if you want I/O to be decoded/encoded in UTF-8 regardless of the locale's charset.$/ = \1234
: we set the record separator to a reference to an integer which is a way to specify records of fixed length (in number of characters).then upon reading the first record, we truncate stdin in place (so at the end of the first record) and exit.
GNU sed
With GNU
sed
, you could do (assuming the file doesn't contain NUL characters or sequences of bytes which don't form valid characters -- both of which should be true of text files):But that's far less efficient, as it reads the file in full and stores it whole in memory, and writes a new copy.
GNU awk
Same with GNU
awk
:-e code -E /dev/null "$file"
being one way to pass arbitrary file names togawk
RS='^$'
: slurp mode.Shell builtins
With
ksh93
,bash
orzsh
(with shells other thanzsh
, assuming the content doesn't contain NUL bytes):With
zsh
:Or:
With
ksh93
orbash
(beware it's bogus for multi-byte characters in several versions ofbash
):ksh93
can also truncate the file in place instead of rewriting it with its<>;
redirection operator:iconv + head
To print the first 1234 characters, another option could be to convert to an encoding with a fixed number of bytes per character like
UTF32BE
/UCS-4
:head -c
is not standard, but fairly common. A standard equivalent would bedd bs=1 count="$((1234 * 4))"
but would be less efficient, as it would read the input and write the output one byte at a time¹.iconv
is a standard command but the encoding names are not standardized, so you might find systems withoutUCS-4
Notes
In any case, though the output would have at most 1234 characters, it may end up not being valid text, as it would possibly end in a non-delimited line.
Also note that while while those solutions wouldn't cut text in the middle of a character, they could break it in the middle of a grapheme , like a
é
expressed as U+0065 U+0301 (ae
followed by a combining acute accent), or Hangul syllable graphemes in their decomposed forms.¹ and on pipe input you can't use
bs
values other than 1 reliably unless you use theiflag=fullblock
GNU extension, asdd
could do short reads if it reads the pipe quicker thaniconv
fills it