How to get the number of bytes in just one line of a file

newlinestext processingwc

I am wondering how I can get the number of bytes in just one line of a file.

I know I can use wc -l to get the number of lines in a file, and wc -c to get the total number of bytes in a file. What I want, however, is to get the number of bytes in just one line of a file.

How would I be able to do this?

Best Answer

sed -n 10p myfile | wc -c

will count the bytes in the tenth line of myfile (including the linefeed/newline character).

A slightly less readable variant,

sed -n "10{p;q;}" myfile | wc -c

(or sed '10!d;q' or sed '10q;d') will stop reading the file after the tenth line, which would be interesting on longer files (or streams). (Thanks to Tim Kennedy and Peter Cordes for the discussion leading to this.)

There are performance comparisons of different ways of extracting lines of text in cat line X to line Y on a huge file.

Related Solutions

Filter File by Line Number – Efficient Text Processing Techniques

With C omitting meaningful error messages:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[]) {

    FILE *L;
    FILE *F;

    unsigned int to_print;
    unsigned int current = 0;
    char *line = NULL;
    size_t len = 0;

    if ((L = fopen(argv[1], "r")) == NULL) {
        return 1;
    } else if ((F = fopen(argv[2], "r")) == NULL) {
        fclose(L);
        return 1;
    } else {

        while (fscanf(L, "%u", &to_print) > 0) {
            while (getline(&line, &len, F) != -1 && ++current != to_print);
            if (current == to_print) {
                printf("%s", line);
            }
        }

        free(line);
        fclose(L);
        fclose(F);
        return 0;
    }
}

How to truncate file to maximum number of characters (not bytes)

Some systems have a truncate command that truncates files to a number of bytes (not characters).

I don't know of any that truncate to a number of characters, though you could resort to perl which is installed by default on most systems:

perl

perl -Mopen=locale -ne '
  BEGIN{$/ = \1234} truncate STDIN, tell STDIN; last' <> "$file"

With -Mopen=locale, we use the locale's notion of what characters are (so in locales using the UTF-8 charset, that's UTF-8 encoded characters). Replace with -CS if you want I/O to be decoded/encoded in UTF-8 regardless of the locale's charset.
$/ = \1234: we set the record separator to a reference to an integer which is a way to specify records of fixed length (in number of characters).
then upon reading the first record, we truncate stdin in place (so at the end of the first record) and exit.

GNU sed

With GNU sed, you could do (assuming the file doesn't contain NUL characters or sequences of bytes which don't form valid characters -- both of which should be true of text files):

sed -Ez -i -- 's/^(.{1234}).*/\1/' "$file"

But that's far less efficient, as it reads the file in full and stores it whole in memory, and writes a new copy.

GNU awk

Same with GNU awk:

awk -i inplace -v RS='^$' -e '{printf "%s", substr($0, 1, 1234)}' -E /dev/null "$file"

-e code -E /dev/null "$file" being one way to pass arbitrary file names to gawk
RS='^$': slurp mode.

Shell builtins

With ksh93, bash or zsh (with shells other than zsh, assuming the content doesn't contain NUL bytes):

content=$(cat < "$file" && echo .) &&
  content=${content%.} &&
  printf %s "${content:0:1234}" > "$file"

With zsh:

read -k1234 -u0 s < $file &&
  printf %s $s > $file

Or:

zmodload zsh/mapfile
mapfile[$file]=${mapfile[$file][1,1234]}

With ksh93 or bash (beware it's bogus for multi-byte characters in several versions of bash):

IFS= read -rN1234 s < "$file" &&
  printf %s "$s" > "$file"

ksh93 can also truncate the file in place instead of rewriting it with its <>; redirection operator:

IFS= read -rN1234 0<>; "$file"

iconv + head

To print the first 1234 characters, another option could be to convert to an encoding with a fixed number of bytes per character like UTF32BE/UCS-4:

iconv -t UCS-4 < "$file" | head -c "$((1234 * 4))" | iconv -f UCS-4

head -c is not standard, but fairly common. A standard equivalent would be dd bs=1 count="$((1234 * 4))" but would be less efficient, as it would read the input and write the output one byte at a time¹. iconv is a standard command but the encoding names are not standardized, so you might find systems without UCS-4

Notes

In any case, though the output would have at most 1234 characters, it may end up not being valid text, as it would possibly end in a non-delimited line.

Also note that while while those solutions wouldn't cut text in the middle of a character, they could break it in the middle of a grapheme , like a é expressed as U+0065 U+0301 (a e followed by a combining acute accent), or Hangul syllable graphemes in their decomposed forms.

^{¹ and on pipe input you can't use bs values other than 1 reliably unless you use the iflag=fullblock GNU extension, as dd could do short reads if it reads the pipe quicker than iconv fills it}