How to truncate file to maximum number of characters (not bytes)

text processing

How can I truncate a (UTF-8 encoded) text file to given number of characters? I don't care about line lengths and the cut can be in the middle of word.

cut seems to operate on lines, but I want a whole file.
head -c uses bytes, not characters.

Best Answer

Some systems have a truncate command that truncates files to a number of bytes (not characters).

I don't know of any that truncate to a number of characters, though you could resort to perl which is installed by default on most systems:

perl

perl -Mopen=locale -ne '
  BEGIN{$/ = \1234} truncate STDIN, tell STDIN; last' <> "$file"

With -Mopen=locale, we use the locale's notion of what characters are (so in locales using the UTF-8 charset, that's UTF-8 encoded characters). Replace with -CS if you want I/O to be decoded/encoded in UTF-8 regardless of the locale's charset.
$/ = \1234: we set the record separator to a reference to an integer which is a way to specify records of fixed length (in number of characters).
then upon reading the first record, we truncate stdin in place (so at the end of the first record) and exit.

GNU sed

With GNU sed, you could do (assuming the file doesn't contain NUL characters or sequences of bytes which don't form valid characters -- both of which should be true of text files):

sed -Ez -i -- 's/^(.{1234}).*/\1/' "$file"

But that's far less efficient, as it reads the file in full and stores it whole in memory, and writes a new copy.

GNU awk

Same with GNU awk:

awk -i inplace -v RS='^$' -e '{printf "%s", substr($0, 1, 1234)}' -E /dev/null "$file"

-e code -E /dev/null "$file" being one way to pass arbitrary file names to gawk
RS='^$': slurp mode.

Shell builtins

With ksh93, bash or zsh (with shells other than zsh, assuming the content doesn't contain NUL bytes):

content=$(cat < "$file" && echo .) &&
  content=${content%.} &&
  printf %s "${content:0:1234}" > "$file"

With zsh:

read -k1234 -u0 s < $file &&
  printf %s $s > $file

Or:

zmodload zsh/mapfile
mapfile[$file]=${mapfile[$file][1,1234]}

With ksh93 or bash (beware it's bogus for multi-byte characters in several versions of bash):

IFS= read -rN1234 s < "$file" &&
  printf %s "$s" > "$file"

ksh93 can also truncate the file in place instead of rewriting it with its <>; redirection operator:

IFS= read -rN1234 0<>; "$file"

iconv + head

To print the first 1234 characters, another option could be to convert to an encoding with a fixed number of bytes per character like UTF32BE/UCS-4:

iconv -t UCS-4 < "$file" | head -c "$((1234 * 4))" | iconv -f UCS-4

head -c is not standard, but fairly common. A standard equivalent would be dd bs=1 count="$((1234 * 4))" but would be less efficient, as it would read the input and write the output one byte at a time¹. iconv is a standard command but the encoding names are not standardized, so you might find systems without UCS-4

Notes

In any case, though the output would have at most 1234 characters, it may end up not being valid text, as it would possibly end in a non-delimited line.

Also note that while while those solutions wouldn't cut text in the middle of a character, they could break it in the middle of a grapheme , like a é expressed as U+0065 U+0301 (a e followed by a combining acute accent), or Hangul syllable graphemes in their decomposed forms.

^{¹ and on pipe input you can't use bs values other than 1 reliably unless you use the iflag=fullblock GNU extension, as dd could do short reads if it reads the pipe quicker than iconv fills it}

Related Solutions

Shorten long lines in a log file

In sed, all commands can be prefixed by a condition that indicates what lines to apply the command to. A common kind of condition is a search pattern. The search pattern /.\{250\}/ matches lines with more than 250 characters. For such lines, match the first 80 characters and the last 40, and replace the whole line by the prefix, __ and the suffix.

sed -e '/.\{250\}/ s/^\(.\{80\}\).*\(.\{40\}\)$/\1__\2/'

You can even arrange for the pattern of the replacement command to match only sufficiently long lines.

sed -e 's/^\(.\{80\}\).\{130,\}\(.\{40\}\)$//'

UTF-8 – Can Not Use `cut -c` with UTF-8 Characters?

You haven't said which cut you're using, but since you've mentioned the GNU long option --characters I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation':

‘-c character-list’
‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

(emphasis added)

For the moment, GNU cut always works in terms of single-byte "characters", so the behaviour you see is expected.

Supporting both the -b and -c options is required by POSIX — they weren't added to GNU cut because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c has been done in some other cut implementations, although not FreeBSD's and OS X's at least.

This is the historic behaviour of -c. -b was newly added to take over the byte role so that -c can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut doesn't even implement the -n option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.