How to search for the word stored in the hold space with sed

perlregular expressionsedtext processing

This is a sed-specific question; I am well aware it could be done with other tools but I am working on expanding my knowledge of sed.

How can I use sed to globally quote (actually backtick) a word that is not specified in the script? The word is held in the hold space.

What I want is something like:

s/word/`&`/g

But the trick is, word will be contained not in the sed script but in the hold space. So it looks something more like:

H
g
s/^\(.*\)\n\(.*\)\1\(.*\)$/\2`\1`\3/

which will quote one occurrence of the word held in the hold space. I want to quote all of them, but I can't just add a g flag, because of the way this uses backreferences rather than a static regex.

H
g
s/^\(.*\)\n\(.*\)\1\(.*\)\1\(.*\)$/\2`\1`\3`\1`\4/

This handles two occurrences of the word, but fails on one, and ignores more than one.

I thought I could use something clean and simple like:

s//`&`/g

But that reuses the last used regex, not what it matches. (Which makes sense.)

Is there any way in sed to do what I am trying to do? (Actually I would be interested in seeing how easy this would be in perl, but I would still like to see how to do it in sed.)

UPDATE

Not that it's needed for this question, but I thought I would give a little more context on what exactly I was doing when I came up with this question:

I had a big text file of documentation, certain parts of which needed to be condensed and summarized into an asciidoc table. It was pretty easy because of the Description: and Prototype: lines, etc., so I actually wrote a quick sed script to do all the parsing for me. It worked beautifully—but the one thing it was missing was that I wanted to backtick the words in the Description line that matched the arguments listed in the Prototype line. The prototype lines looked something like this:

Prototype: some_words_here(and, arg, list,here)

There were upwards of 200 different entries in the table I was outputting (and the source documentation included a lot more text than that) and each arglist only needed to be used to backtick-quote matching words on a single line. To make things trickier, some of the args were not in the Description line, some were in more than once, and some arglists were empty().

However, given that sometimes an arg would match a part of a word, which I didn't want to get backticked, and sometimes an arg name was a common word (like from) which I only wanted to get backticked when it was used in the context of explaining the use of the function, an automated solution wasn't actually a good fit at all and I instead used vim to do the job semi-manually, with the help of some tricky macros. 🙂

Best Answer

That was a hard one. Assuming you have a file like this:

$ cat file
word
line with a word and words and wording wordy words.

Where:

Line 1: is the search pattern that should be held in the hold space and quoted to `word`.
Line 2: is the line to seach and replace globally.

The sed command:

sed -n '1h; 2{x;G;:l;s/^\([^\n]\+\)\n\(.*[^`]\)\1\([^`]\)/\1\n\2`\1`\3/;tl;p}' file

Explanation:

1h; save the first line to the hold space (this is wait we want to search for).
- hold space contains: word
2{...} applies to the second line.
x; exchange the pattern space and the hold space.
G; append the hold space to the pattern space. In the pattern space we have now:

word # I will call this line the "pattern line" from now on
line with a word and words and wording wordy words.

:l; set a label called l as point for later.
s/// do the actual search/replace in the pattern space mentioned above:
- ^$[^\n]\+$\n search in the "pattern line" for all characters (from the beginning of the line ^) which are not a newline [^\n] (one or more times \+), until a newline \n. This is now stored in the back-reference \1. It contains the "pattern line".
- (.*[^`]) search for any character .* followed by a character, which is not a backtick [^`]. This is stored in \2. \2 contains now: line with a word and words and wording wordy, until the last occurence of word, because...
- \1 is the next search term (the back-reference \1, word), hence what the "pattern line" contains.
- ([^`]) this is followed by another character which is not a backtick; saved to reference \3. If we don't do this (and the part in \2 from above), we would end of in an endless loop quoting the same word, again and again -> ````word````, because s/// would always be successful and tl; jumps back to :l (see tl; further down).
- \1\n\2\1\3 all of the above is replaced by the back-references. The second \1 is the one we should quote (note the first reference is the "pattern line").
tl; if the s/// was successful (we replaced something) jump to the label called l and start again until there is nothing more to search and replace. This is the case, when all occurences of word are replaced/quoted.
p; when all is done, print the altered line (pattern space).

The output:

$ sed -n '1h; 2{x;G;:l;s/^\([^\n]\+\)\n\(.*[^`]\)\1\([^`]\)/\1\n\2`\1`\3/;tl;p}' file
word
line with a `word` and `word`s and `word`ing `word`y `word`s.

Related Solutions

Extracting a regex matched with ‘sed’ without printing the surrounding characters

When a regexp contains groups, there may be more than one way to match a string against it: regexps with groups are ambiguous. For example, consider the regexp ^.*$[0-9][0-9]*$$ and the string a12. There are two possibilities:

Match a against .* and 2 against [0-9]*; 1 is matched by [0-9].
Match a1 against .* and the empty string against [0-9]*; 2 is matched by [0-9].

Sed, like all other regexp tools out there, applies the earliest longest match rule: it first tries to match the first variable-length portion against a string that's as long as possible. If it finds a way to match the rest of the string against the rest of the regexp, fine. Otherwise, sed tries the next longest match for the first variable-length portion and tries again.

Here, the match with the longest string first is a1 against .*, so the group only matches 2. If you want the group to start earlier, some regexp engines let you make the .* less greedy, but sed doesn't have such a feature. So you need to remove the ambiguity with some additional anchor. Specify that the leading .* cannot end with a digit, so that the first digit of the group is the first possible match.

If the group of digits cannot be at the beginning of the line:
```
sed -n 's/^.*[^0-9]$[0-9][0-9]*$.*/\1/p'
```
If the group of digits can be at the beginning of the line, and your sed supports the \? operator for optional parts:
```
sed -n 's/^$.*[^0-9]$\?$[0-9][0-9]*$.*/\1/p'
```
If the group of digits can be at the beginning of the line, sticking to standard regexp constructs:
```
sed -n -e 's/^.*[^0-9]$[0-9][0-9]*$.*/\1/p' -e t -e 's/^$[0-9][0-9]*$.*/\1/p'
```

By the way, it's that same earliest longest match rule that makes [0-9]* match the digits after the first one, rather than the subsequent .*.

Note that if there are multiple sequences of digits on a line, your program will always extract the last sequence of digits, again because of the earliest longest match rule applied to the initial .*. If you want to extract the first sequence of digits, you need to specify that what comes before is a sequence of non-digits.

sed -n 's/^[^0-9]*\([0-9][0-9]*\).*$/\1/p'

More generally, to extract the first match of a regexp, you need to compute the negation of that regexp. While this is always theoretically possible, the size of the negation grows exponentially with the size of the regexp you're negating, so this is often impractical.

Consider your other example:

sed -n 's/.*\(CONFIG_[a-zA-Z0-9_]*\).*/\1/p'

This example actually exhibits the same issue, but you don't see it on typical inputs. If you feed it hello CONFIG_FOO_CONFIG_BAR, then the command above prints out CONFIG_BAR, not CONFIG_FOO_CONFIG_BAR.

There's a way to print the first match with sed, but it's a little tricky:

sed -n -e 's/\(CONFIG_[a-zA-Z0-9_]*\).*/\n\1/' -e T -e 's/^.*\n//' -e p

(Assuming your sed supports \n to mean a newline in the s replacement text.) This works because sed looks for the earliest match of the regexp, and we don't try to match what precedes the CONFIG_… bit. Since there is no newline inside the line, we can use it as a temporary marker. The T command says to give up if the preceding s command didn't match.

When you can't figure out how to do something in sed, turn to awk. The following command prints the earliest longest match of a regexp:

awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH)}'

And if you feel like keeping it simple, use Perl.

perl -l -ne '/[0-9]+/ && print $&'       # first match
perl -l -ne '/^.*([0-9]+)/ && print $1'  # last match

Substituting the first occurrence of a pattern in a line, for all the lines in a file with sed

You're overthinking it. sed replaces only the first instance on a line by default (without the /g modifier), although you still want to anchor because you don;t so much want the first instance in the line as the one at the start of the line; and you usually don't need the explicit line actions you're trying to use (why?).

sed 's/^" /"/'

Best Answer

Related Solutions

Extracting a regex matched with ‘sed’ without printing the surrounding characters

Substituting the first occurrence of a pattern in a line, for all the lines in a file with sed

Related Question