Gnu Sed – Does the p Command Append a Newline When Printing

sed

root@u1804:~# sed --version
sed (GNU sed) 4.5
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Jay Fenlason, Tom Lord, Ken Pizzini,
and Paolo Bonzini.
GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
root@u1804:~#

I'm new to sed and I created below sed's workflow based on my understanding (correct me if you find anything wrong).

So it seems the default auto printing of the pattern space will always include a newline at the end. My question is, will p includes a newline, too? I have below examples.

root@u1804:~# seq 3 | sed -rn 'p'
1
2
3
root@u1804:

Here the newline at the end of each number is added by sed itself (see the diagram "adds back newline to pattern space"). So it seems p will not append a newline. However, see below example.

root@u1804:~# seq 3 | sed -rn 'x;p;x;p'

1

2

3
root@u1804:~#

Here x exchange pattern space with hold space, which will result in an empty pattern space. Now p applies to the pattern space (nothing in it) should print nothing. But based on the result, it seems here p prints a newline. To me it seems this is inconsistent behavior. Can anyone explain?

Best Answer

To answer your main question:

GNU sed will append a <newline> character when executing the p command unless the input line was missing its terminating <newline> character (see the clarifications about lines below).

As far as I can tell, sed's p flag and its auto-print feature implement the same logic to output the pattern space: if the trailing <newline> character was removed, they add it back; otherwise they don't.

Examples:

$ printf '%s\n%s' '4' '5' | sed ';' | hexdump -C      # auto-print
00000000  34 0a 35                                          |4.5|
00000003

$ printf '%s\n%s' '4' '5' | sed -n 'p;' | hexdump -C  # no auto-print; p flag
00000000  34 0a 35                                          |4.5|
00000003

In both cases there is no <newline> character (0a) in the output for the input lines that don't have one.

About your diagrams:

"Adds back newline to pattern space" is probably inaccurate because the <newline> character is not put in the pattern space¹. Also, that step is not related to the -n option - but this does not make the diagram wrong; rather, it should probably be merged into "Print pattern space".
Still, I agree with you about the documentation's lack of clarity.

¹_{The sentence you quote in your own answer, "the contents of pattern space are printed out to the output stream, adding back the trailing newline if it was removed", means that the <newline> is appended to the stream, not to pattern space. Of course, since pattern space is cleared in a short while, this is a really minor point}

About your tests involving the x flag:

Internally, pattern space and hold space are structures, and "was my trailing <newline> character dropped?" is a member of them. We will call it chomped (as it is named in sed's source code, by the way).
Pattern space is filled with a read line and its chomped attribute depends on how that line is terminated: true if it ends with a <newline> character, false otherwise. On the other hand, hold space is initialized as empty and its chomped attributed is just set to true.
Therefore, when you swap pattern space and hold space and print what was born as hold and is now pattern, a <newline> character is printed.

Examples - these commands have the same output:

$ printf '\n' | sed -n 'p;' | hexdump -C        # input is only a <newline>
00000000  0a                                                |.|
00000001

$ printf '%s' '5' | sed -n 'x;p;' | hexdump -C  # input has no <newline>
00000000  0a                                                |.|
00000001

(I gave only a really brief look at sed's code, so this might well be not accurate).

About lines (clarification started with comments to your answer):

It goes without saying that a line without a terminating <newline> character is a problematic concept. Quoting POSIX:

3.206 Line
A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

Furthermore, POSIX defines a text file:

3.403 Text File
A file that contains characters organized into zero or more lines. ...

Finally, POSIX on sed (bold mine):

DESCRIPTION
The sed utility is a stream editor that shall read one or more text files, make editing changes according to a script of editing commands, and write the results to standard output. ...

GNU sed, though, seems to be less strict when defining its input:

sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). ...

So, relating to my first sentence, we should take into account that, for GNU sed, what is read into the pattern space doesn't necessarily have to be a well formed line of text.

Related Solutions

Can sed remove ‘double’ newline characters

Just to remove empty lines:

sed  '/^$/d'

sed is line oriented, so thinking in terms of "2 or more of a particular byte" works except when that byte is a newline. Then you have to think of something that works for the whole line.

What does POSIX sed require for `1d;1,2d` where an address range starts from an already-deleted line

That was raised on the Austin group mailing list in March 2012. Here's the final message on that (by Geoff Clare of the Austin Group (the body that maintains POSIX), who is also the one who raised the issue in the first place). Here copied from the gmane NNTP interface:

Date: Fri, 16 Mar 2012 17:09:42 +0000
From: Geoff Clare <gwc-7882/jkIBncuagvECLh61g@public.gmane.org>
To: austin-group-l-7882/jkIBncuagvECLh61g@public.gmane.org
Newsgroups: gmane.comp.standards.posix.austin.general
Subject: Re: Strange addressing issue in sed

Stephane Chazelas <stephane_chazelas-Qt13gs6zZMY@public.gmane.org> wrote, on 16 Mar 2012:
>
> 2012-03-16 15:44:35 +0000, Geoff Clare:
> > I've been alerted to an odd behaviour of sed on certified UNIX
> > systems that doesn't seem to match the requirements of the
> > standard.  It concerns an interaction between the 'n' command
> > and address matching.
> > 
> > According to the standard, this command:
> > 
> > printf 'A\nB\nC\nD\n' | sed '1,3s/A/B/;1,3n;1,3s/B/C/'
> > 
> > should produce the output:
> > 
> > B
> > C
> > C
> > D
> > 
> > GNU sed does produce this, but certified UNIX systems produce this:
> > 
> > B
> > B
> > C
> > D
> > 
> > However, if I change the 1,3s/B/C/ to 2,3s/B/C/ then they produce
> > the expected output (tested on Solaris and HP-UX).
> > 
> > Is this just an obscure bug from common ancestor code, or is there
> > some legitimate reason why this address change alters the behaviour?
> [...]
> 
> I suppose the idea is that for the second 1,3cmd, line "1" has
> not been seen, so the 1,3 range is not entered.

Ah yes, now it makes sense, and it looks like the standard does
require this slightly strange behaviour, given how the processing
of the "two addresses" case is specified:

    An editing command with two addresses shall select the inclusive
    range from the first pattern space that matches the first address
    through the next pattern space that matches the second.  (If the
    second address is a number less than or equal to the line number
    first selected, only one line shall be selected.) Starting at the
    first line following the selected range, sed shall look again for
    the first address. Thereafter, the process shall be repeated.

It's specified this way because the addresses can be BREs, but if
the same matching process is applied to the line numbers (even though
they can only match at most once), then the 1,3 range on that last
command is never entered.

-- 
Geoff Clare <g.clare-7882/jkIBncuagvECLh61g@public.gmane.org>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

And here's the relevant part of the rest of the message (by me) that Geoff was quoting:

I suppose the idea is that for the second 1,3cmd, line "1" has
not been seen, so the 1,3 range is not entered.

Same idea as in

printf '%s\n' A B C | sed -n '1d;1,2p'

whose behavior differ in traditional (heirloom toolchest at
least) and GNU.

It's unclear to me whether POSIX wants one behavior or the
other.

So, (according to Geoff) POSIX is clear that the GNU behaviour is non-compliant.

And it's true it's less consistent (compare seq 10 | sed -n '1d;1,2p' with seq 10 | sed -n '1d;/^1$/,2p') even if potentially less surprising to people who don't realise how ranges are processed (even Geoff initially found the conforming behaviour "strange").

Nobody bothered reporting it as a bug to the GNU folks. I'm not sure I'd qualify it as a bug. Probably the best option would be for the POSIX specification to be updated to allow both behaviours to make it clear that one cannot rely on either.

Edit. I've now had a look at the original sed implementation in Unix V7 from the late 70s, and it looks pretty much like that behaviour for numeric addresses was not intended or at least not thought through completely there.

With Geoff's reading of the spec (and my original interpretation of why it happens), conversely, in:

seq 5 | sed -n '3d;1,3p'

lines 1, 2, 4 and 5 should be output, because this time, it's the end address that is never encountered by the 1,3p ranged command, like in seq 5 | sed -n '3d;/1/,/3/p'

Yet, that doesn't happen in the original implementation, nor any other implementation I tried (busybox sed returns lines 1, 2 and 4 which looks more like a bug).

If you look at the UNIX v7 code, it does check for the case where the current line number is greater than the (numerical) end address, and gets out of the range then. The fact that it doesn't do it for the start address looks more like an oversight then than an intentional design.

What that means is that there's no implementation that is actually compliant to that interpretation of the POSIX spec in that regard at the moment.

Another confusing behaviour with the GNU implementation is:

$ seq 5 | sed -n '2d;2,/3/p'
3
4
5

Since line 2 was skipped, the 2,/3/ is entered upon line 3 (the first line whose number is >= 2). But as it's the line that made us enter the range, it's not checked for the end address. It gets worse with busybox sed in:

$ seq 10 | busybox sed -n '2,7d; 2,3p'
8

Since lines 2 to 7 were deleted, line 8 is the first one that is >= 2 so the 2,3 range is entered then!

Best Answer

Related Solutions

Can sed remove ‘double’ newline characters

What does POSIX sed require for `1d;1,2d` where an address range starts from an already-deleted line

Related Question