What does POSIX sed require for `1d;1,2d` where an address range starts from an already-deleted line

posixsed

In the comments to this question a case came up where various sed implementations disagreed on a fairly simple program, and we (or at least I) weren't able to determine what the specification actually requires for it.

The issue is the behaviour of a range beginning at a deleted line:

1d;1,2d

Should line 2 be deleted even though the start of the range was removed before reaching that command? My initial expectation was "no" in line with BSD sed, while GNU sed says "yes", and checking the specification text doesn't entirely resolve the matter.

Matching my expectation are (at least) macOS and Solaris sed, and BSD sed. Disagreeing are (at least) GNU and Busybox sed, and numerous people here. The first two are SUS-certified while the others are likely more widespread. Which behaviour is correct?

The specification text for two-address ranges says:

The sed utility shall then apply in sequence all commands whose addresses select that pattern space, until a command starts the next cycle or quits.

and

An editing command with two addresses shall select the inclusive range from the first pattern space that matches the first address through the next pattern space that matches the second. […] Starting at the first line following the selected range, sed shall look again for the first address. Thereafter, the process shall be repeated.

Arguably, line 2 is within "the inclusive range from the first pattern space that matches the first address through the next pattern space that matches the second", regardless of whether the start point has been deleted. On the other hand, I expected the first d to move on to the next cycle and not give the range a chance to start. The UNIX™-certified implementations do what I expected, but potentially not what the specification mandates.

Some illustrative experiments follow, but the key question is: what should sed do when a range begins on a deleted line?

Experiments and examples

A simplified demonstration of the issue is this, which prints extra copies of lines rather than deleting them:

printf 'a\nb\n' | sed -e '1d;1,2p'

This provides sed with two lines of input, a and b. The program does two things:

Deletes the first line with 1d. The d command will

Delete the pattern space and start the next cycle.
and
Select the range of lines from 1 to 2 and explicitly prints them out, in addition to the automatic printing every line receives. A line included in the range should thus appear twice.

My expectation was that this should print

only, with the range not applying because 1,2 is never reached during line 1 (because d jumped to the next cycle/line already) and so range inclusion never begins, while a has been deleted. The conformant Unix seds of macOS and Solaris 10 produce this output, as does the non-POSIX sed in Solaris and BSD sed in general.

GNU sed, on the other hand, prints

b
b

indicating that it has interpreted the range. This occurs both in POSIX mode and not. Busybox's sed has the same behaviour (but not identical behaviour always, so it doesn't seem to be a result of shared code).

Further experimentation with

printf 'a\nb\nc\nd\ne\n' | sed -e '2d;2,/c/p'
printf 'a\nb\nc\nd\ne\n' | sed -e '2d;2,/d/p'

finds that it appears to treat a range starting at a deleted line as though it starts on the following line. This is visible because /c/ does not match to end the range. Using /b/ to start the range does not behave the same as 2.

The initial working example I was using was

printf '%s\n' a b c d e | sed -e '1{/a/d;};1,//d'

as a way to delete all lines up to the first /a/ match, even if that is on the first line (what GNU sed would use 0,/a/d for — this was an attempted POSIX-compatible rendition of that).

It has been suggested that this should instead delete up to the second match of /a/ if the first line matches (or the whole file if there's no second match), which seems plausible – but again, only GNU sed does that. Both macOS sed and Solaris's sed produce

b
c
d
e

for that, as I expected (GNU sed produces the empty output from removing the unterminated range; Busybox sed prints just d and e, which is clearly wrong no matter what). Generally I'd assume that their having passed the certification conformance tests means that their behaviour is correct, but enough people have suggested otherwise that I'm not sure, the specification text isn't completely convincing, and the test suite can't be perfectly comprehensive.

Clearly it isn't practically portable to write that code today given the inconsistency, but theoretically it should be equivalent everywhere with one meaning or the other. I think this is a bug, but I don't know against which implementation(s) to report it. My view currently is that GNU and Busybox sed's behaviour is inconsistent with the specification, but I could be mistaken on that.

What does POSIX require here?

Best Answer

That was raised on the Austin group mailing list in March 2012. Here's the final message on that (by Geoff Clare of the Austin Group (the body that maintains POSIX), who is also the one who raised the issue in the first place). Here copied from the gmane NNTP interface:

Date: Fri, 16 Mar 2012 17:09:42 +0000
From: Geoff Clare <gwc-7882/jkIBncuagvECLh61g@public.gmane.org>
To: austin-group-l-7882/jkIBncuagvECLh61g@public.gmane.org
Newsgroups: gmane.comp.standards.posix.austin.general
Subject: Re: Strange addressing issue in sed

Stephane Chazelas <stephane_chazelas-Qt13gs6zZMY@public.gmane.org> wrote, on 16 Mar 2012:
>
> 2012-03-16 15:44:35 +0000, Geoff Clare:
> > I've been alerted to an odd behaviour of sed on certified UNIX
> > systems that doesn't seem to match the requirements of the
> > standard.  It concerns an interaction between the 'n' command
> > and address matching.
> > 
> > According to the standard, this command:
> > 
> > printf 'A\nB\nC\nD\n' | sed '1,3s/A/B/;1,3n;1,3s/B/C/'
> > 
> > should produce the output:
> > 
> > B
> > C
> > C
> > D
> > 
> > GNU sed does produce this, but certified UNIX systems produce this:
> > 
> > B
> > B
> > C
> > D
> > 
> > However, if I change the 1,3s/B/C/ to 2,3s/B/C/ then they produce
> > the expected output (tested on Solaris and HP-UX).
> > 
> > Is this just an obscure bug from common ancestor code, or is there
> > some legitimate reason why this address change alters the behaviour?
> [...]
> 
> I suppose the idea is that for the second 1,3cmd, line "1" has
> not been seen, so the 1,3 range is not entered.

Ah yes, now it makes sense, and it looks like the standard does
require this slightly strange behaviour, given how the processing
of the "two addresses" case is specified:

    An editing command with two addresses shall select the inclusive
    range from the first pattern space that matches the first address
    through the next pattern space that matches the second.  (If the
    second address is a number less than or equal to the line number
    first selected, only one line shall be selected.) Starting at the
    first line following the selected range, sed shall look again for
    the first address. Thereafter, the process shall be repeated.

It's specified this way because the addresses can be BREs, but if
the same matching process is applied to the line numbers (even though
they can only match at most once), then the 1,3 range on that last
command is never entered.

-- 
Geoff Clare <g.clare-7882/jkIBncuagvECLh61g@public.gmane.org>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

And here's the relevant part of the rest of the message (by me) that Geoff was quoting:

I suppose the idea is that for the second 1,3cmd, line "1" has
not been seen, so the 1,3 range is not entered.

Same idea as in

printf '%s\n' A B C | sed -n '1d;1,2p'

whose behavior differ in traditional (heirloom toolchest at
least) and GNU.

It's unclear to me whether POSIX wants one behavior or the
other.

So, (according to Geoff) POSIX is clear that the GNU behaviour is non-compliant.

And it's true it's less consistent (compare seq 10 | sed -n '1d;1,2p' with seq 10 | sed -n '1d;/^1$/,2p') even if potentially less surprising to people who don't realise how ranges are processed (even Geoff initially found the conforming behaviour "strange").

Nobody bothered reporting it as a bug to the GNU folks. I'm not sure I'd qualify it as a bug. Probably the best option would be for the POSIX specification to be updated to allow both behaviours to make it clear that one cannot rely on either.

Edit. I've now had a look at the original sed implementation in Unix V7 from the late 70s, and it looks pretty much like that behaviour for numeric addresses was not intended or at least not thought through completely there.

With Geoff's reading of the spec (and my original interpretation of why it happens), conversely, in:

seq 5 | sed -n '3d;1,3p'

lines 1, 2, 4 and 5 should be output, because this time, it's the end address that is never encountered by the 1,3p ranged command, like in seq 5 | sed -n '3d;/1/,/3/p'

Yet, that doesn't happen in the original implementation, nor any other implementation I tried (busybox sed returns lines 1, 2 and 4 which looks more like a bug).

If you look at the UNIX v7 code, it does check for the case where the current line number is greater than the (numerical) end address, and gets out of the range then. The fact that it doesn't do it for the start address looks more like an oversight then than an intentional design.

What that means is that there's no implementation that is actually compliant to that interpretation of the POSIX spec in that regard at the moment.

Another confusing behaviour with the GNU implementation is:

$ seq 5 | sed -n '2d;2,/3/p'
3
4
5

Since line 2 was skipped, the 2,/3/ is entered upon line 3 (the first line whose number is >= 2). But as it's the line that made us enter the range, it's not checked for the end address. It gets worse with busybox sed in:

$ seq 10 | busybox sed -n '2,7d; 2,3p'
8

Since lines 2 to 7 were deleted, line 8 is the first one that is >= 2 so the 2,3 range is entered then!

Related Solutions

The point of using multiple exclamation marks in sed

sed's API is primitive - and this is by design. At least, it has remained primitive by design - whether it was designed primitively at inception I cannot say. In most cases the writing of a sed script which, when run, will output another sed script is a simple matter indeed. sed is very often applied in this way by macro preprocessors such as m4 and/or make.

(What follows is a highly hypothetical use case: it is a problem engineered to suit a solution. If it feels like a stretch to you, then that is probably because it is, but that doesn't necessarily make it any less valid.)

Consider the following input file:

cat <<"" >./infile
camel
cat dog camel
dog cat
switch
upper
lower

If we wanted to write a sed script which would append the word -case to the tail of each appropriate word in the above input file only if it could be found on a line in appropriate context, and we desired to do so as efficiently as possible (as should be our goal, for example, during a compile operation) then we should prefer to avoid applying /regexp/s as much as possible.

One thing we might do is pre-edit the file on our system right now, and never call sed at all during compilation. But if any of those words in the file should or should not be included based on local settings and/or compile-time options, then doing so would likely not be a desirable alternative.

Another thing we might do is process the file now against regexps. We can produce - and include in our compilation - a sed script which can apply edits according to line number - which is typically a far more efficient route in the long-run.

For example:

n=$(printf '\\\n\t')
grep -En 'camel|upper|lower' <infile |
sed "   1i${n%?}#!/usr/heirloom/bin/posix2001/sed -nf
        s/[^:]*/:&$n&!n;&!b&$n&/;s/://2;\$a${n%?}q"'
        s/ *cat/!/g;s/ *dog/!/g
        s| *\([cul][^ ]*\).*|s/.*/\1-case/p|'

...which writes output in the form of a sed script and which looks like...

#!/usr/heirloom/bin/posix2001/sed -nf
:1
    1!n;1!b1
    1s/.*/camel-case/p
:2
    2!n;2!b2
    2!!s/.*/camel-case/p
:5
    5!n;5!b5
    5s/.*/upper-case/p
:6
    6!n;6!b6
    6s/.*/lower-case/p
q

When that output is saved to an executable text file on my machine named ./bang.sed and run like ./bang.sed ./infile, the output is:

camel-case
upper-case
lower-case

Now you might ask me... Why would I want to do that? Why would I not just anchor grep's matches? Who uses camel-case anyway? And to each question I could only reply, I have no idea... because I don't. Before reading this question I had never personally noticed the multi-! parsing requirement in the spec - I think it's a pretty neat catch.

The multi-! thing did immediately make sense to me, though - much of the sed specification is geared toward simply parsed and simply generated sed scripts. You'll probably find the required \newline delimiters for [wr:bt{] make a lot more sense in that context, and if you keep that idea in mind you might make better sense of some other aspects of the spec - (such as : accepting no addresses, and q refusing to accept any more than 1).

In the example above I write out a certain form of sed script which can only ever be read once. If you look hard at it you might notice that as sed reads the edit-file it progresses from one command-block to the next - it never branches away from or completes its edit-script until it is completely through with its edit-file.

I consider that multi-! addresses might be more useful in that context than in some others, but, in honesty, I can't think of a single case in which I might have put it to very good use - and I sed a lot. I also think it noteworthy that GNU/BSD seds both fail to handle it as specified - this is probably not an aspect of the spec which is in much demand, and so if an implementation overlooks it I doubt very seriously their bugs@ box will suffer terribly as a result.

That said, failure to handle this as specified is a bug for any implementation which pretends to compliance, and so I think shooting an email to the relevant dev boxes is called-for here, and I intend to do so if you don't.

Why GNU find -execdir command behave differently than BSD find

It's not an endless looping, it's just GNU find reporting that echo died of a SIGPIPE (because the other end of the pipe on stdout has been closed when head died).

-execdir is not specified by POSIX. And even for -exec, there's nothing in the POSIX spec that says that if the command is killed by a SIGPIPE, find should exit.

So, would POSIX specify -execdir, gfind would probably be more POSIX conformant than your BSD find (assuming your BSD find exits upon its child dying of a SIGPIPE as the wording of your question suggests, FreeBSD find doesn't in my tests and does run echo in a loop for every file (like for GNU find, not endless)).

You may say that for most common cases, find exiting upon a child dying of SIGPIPE would be preferable, but the -executed command could still die of a SIGPIPE for other reasons than the pipe on stdout being closed, so exiting find for that would be borderline acceptable.

With GNU find, you can tell find to quit if a command fails with:

find . ... \( -exec echo {} \; -o -quit \)

As to whether a find implementation is allowed or forbidden to report children dying of a signal on stderr, here (with the usage of -execdir) we're outside the scope of POSIX anyway, but if -exec was used in place of -execdir, it seems that would be a case where gfind is not conformant.

The spec for find says: "the standard error shall be used only for diagnostic messages" but also says there:

Default Behavior: When this section is listed as "The standard error shall be used only for diagnostic messages.", it means that, unless otherwise stated, the diagnostic messages shall be sent to the standard error only when the exit status indicates that an error occurred and the utility is used as described by this volume of POSIX.1-2008.

Which would indicate that since find doesn't return with a non-zero exit status in that case, it should not output that message on stderr.

Note that by that text, both GNU and FreeBSD find would be non-compliant in a case like:

$ find /dev/null -exec blah \;; echo "$?"
find: `blah': No such file or directory
0

where both report an error without settng the exit-status to non-zero. Which is why I raised the question on the austin-group (the guys behind POSIX) mailing list.

Note that if you change your command to:

(trap '' PIPE; find -L /etc -execdir echo {} \; | head)

echo will still be run for every file, will still fail, but this time, it will be echo reporting the error message.

Now about filename vs /etc/filename vs ./filename being displayed.

Again, -execdir being not a standard option, there's no text that says who's right and who's wrong. -execdir was introduced by BSD find and copied later by GNU find.

GNU find has done some intentional changes (improvements) over it. For instance, it prepends file names with ./ in the arguments passed to commands. That means that find . -execdir cmd {} \; doesn't have a problem with filenames starting with - for instance.

The fact that -L -execdir doesn't pass a filepath relative to the parent directory is actually a bug that affects version 4.3.0 to 4.5.8 of GNU find. It was fixed in 4.5.9, but that's on the development branch and there hasn't been a new stable release since (as of 2015-12-22, though one is imminent).

More info at the findutils mailing list.

If all you want is print the base name of every file in /etc portably, you can just do:

find -L /etc -exec basename {} \;

Or more efficiently:

find -L ///etc | awk -F / '/\/\// && NR>1 {print last}
                          {if (NF > 1) last = $NF
                           else last = last "\n" $NF}
                          END {if (NR) print last}'

which you can simplify to

find -L /etc | awk -F / '{print $NF}'

if you can guarantee file paths don't contain newline characters (IIRC, some versions of OS/X had such files in /etc though).

GNUly:

find -L /etc -printf '%f\n'

As to whether:

find -exec echo {} \;

in the link you're referring to, is POSIX or not.

No, as a command invocation, that is not POSIX. A script that would have that would be non-compliant.

POSIX find requires that at least one path be given, but leaves the behaviour unspecified if the first non-option argument of find starts with - or is a find predicate (like !, or (), so GNU find behaviour is compliant, so are implementations that report an error (or treat the first argument as a file path even if it represents a find predicate) or spray red paint at your face, there's no reason POSIXLY_CORRECT would affect the find behaviour there.

Experiments and examples

Best Answer

Related Solutions

The point of using multiple exclamation marks in sed

Why GNU find -execdir command behave differently than BSD find

Related Question