POSIX Text Files – Conditions for a File to be a Text File as Defined by POSIX

filesposixtext;

POSIX defines a text file as:

A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character. Although POSIX.1-2017 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.

Source: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403

However, there are several things I find unclear:

  1. Must a text file be a regular file? In the above excerpt it does not explicitly say the file must be a regular file

  2. Can a file be considered a text file if contains one character and one character only (i.e., a single character that isn't terminated with a newline)? I know this question may sound nitpicky, but they use the word "characters" instead of "one or more characters". Others may disagree, but if they mean "one or more characters" I think they should explicitly say it

  3. In the above excerpt, it makes reference to "lines". I found four definitions with line in their name: "Empty Line", "Display Line", "Incomplete Line" and "Line". Am I supposed to infer that they mean "Line" because of their omission of "Empty", "Display" and "Incomplete"- or are all four of these definitions inclusive as being considered a line in the excerpt above?

All questions that come after this block of text depend on inferring that "characters" means "one or more characters":

  1. Can I safely infer that if a file is empty, it is not a text file because it does not contain one or more characters?

All questions that come after this block of text depend on inferring that in the above excerpt, a line is defined as a "Line", and that the other three definitions containing "Line" in their name should be excluded:

  1. Does the "zero" in "zero or more lines" mean that a file can still be considered a text file if it contains one or more characters that are not terminated with newline?

  2. Does "zero or more lines" mean that once a single "Line" (0 or more characters plus a terminating newline) comes into play, that it becomes illegal for the last line to be an "Incomplete Line" (one or more non-newline characters at the end of a file)?

  3. Does "none [no line] can exceed {LINE_MAX} bytes in length, including the newline character" mean that there a limitation to the number of characters allowed in any given "Line" in a text file (as an aside, the value of LINE_MAX on Ubuntu 18.04 and FreeBSD 11.1 is "2048")?

Best Answer

  1. Must a text file be a regular file? In the above excerpt it does not explicitly say the file must be a regular file

    No; the excerpt even specifically notes standard input as a potential text file. Other standard utilities, such as make, specifically use the character special file /dev/null as a text file.

  2. Can a file be considered a text file if contains one character and one character only (i.e., a single character that isn't terminated with a newline)?

    That character must be a <newline>, or this isn't a line, and so the file it's in isn't a text file. A file containing exactly byte 0A is a single-line text file. An empty line is a valid line.

  3. In the above excerpt, it makes reference to "lines". I found four definitions with line in their name: "Empty Line", "Display Line", "Incomplete Line" and "Line". Am I supposed to infer that they mean "Line" because of their omission of "Empty", "Display" and "Incomplete"

    It's not really an inference, it's just what it says. The word "line" has been given a contextually-appropriate definition and so that's what it's talking about.

  4. Can I safely infer that if a file is empty, it is not a text file because it does not contain one or more characters?

    An empty file consists of zero (or more) lines and is thus a text file.

  5. Does the "zero" in "zero or more lines" mean that a file can still be considered a text file if it contains one or more characters that are not terminated with newline?

    No, these characters are not organised into lines.

  6. Does "zero or more lines" mean that once a single "Line" (0 or more characters plus a terminating newline) comes into play, that it becomes illegal for the last line to be an "Incomplete Line" (one or more non-newline characters at the end of a file)?

    It's not illegal, it's just not a text file. A utility requiring a text file to be given to it may behave adversely if given that file instead.

  7. Does "none [no line] can exceed {LINE_MAX} bytes in length, including the newline character" mean that there a limitation to the number of characters allowed in any given "Line" in a text file

    Yes.

This definition is just trying to set some bounds on what a text-based utility (for example, grep) will definitely accept — nothing more. They are also free to accept things more liberally, and quite often they do in practice. They are permitted to use a fixed-size buffer to process a line, to assume a newline appears before it's full, and so on. You may be reading too much into things.