Sed to match pattern between matching curly braces

escape-charactersosxregular expressionsed

From a pattern such as

[string 1]{string 2}

I want to extract string 2, the string between the last pair of matching curly braces — that is delete [string 1] and the open { and close }. My attempt below breaks when there is a additional [, ] pairs in either string 1 or string 2.

Desired Output:

The desired output from the script below begins with foo and ends with a digit:

foo bar 1
foo bar 2
foo[3]{xyz} bar 3
foo $sq[3]{xyz}$ bar 4
foo $sq[3]{xyz}$ bar 5
foo $sq[3]{xyz}$ bar 6
foo $sq[3]{xyz}$ bar 7
foo $sq[3]{xyz}$ bar 8'
foo $sq[abc]{xyz}$ bar 9'
foo $sq[abc]{xyz}$ bar 10'

Assumptions:

Parameter to RemoveInitialSquareBraces always begins with a [ and ends with a }.
The opening [ for string 1 will have a matching ] at the point where the opening { begins for string 2.

Platform:

MacOS 10.9.5

Script

#!/bin/bash

function RemoveInitialSquareBraces {
    #EXTRACTED_TEXT="$(\
    #      echo "$1" \
    #    | sed 's/^\[.*\]//'              \
    #    | sed 's/{//'                    \
    #    | sed 's/}$//'                   \
    #    )"
    EXTRACTED_TEXT="$(\
          echo "$1" \
        | sed 's/.*[^0-9]\]{\(.*\)}/\1/' \
        )"
        
    echo "${EXTRACTED_TEXT}"
}

RemoveInitialSquareBraces '[]{foo bar 1}'
RemoveInitialSquareBraces '[abc]{foo bar 2}'
RemoveInitialSquareBraces '[]{foo[3]{xyz} bar 3}'
RemoveInitialSquareBraces '[]{foo $sq[3]{xyz}$ bar 4}'
RemoveInitialSquareBraces '[goo{w}]{foo $sq[3]{xyz}$ bar 5}'
RemoveInitialSquareBraces '[goo[3]{w}]{foo $sq[3]{xyz}$ bar 6}'
RemoveInitialSquareBraces '[goo[3]{w} hoo[3]{5}]{foo $sq[3]{xyz}$ bar 7}'
RemoveInitialSquareBraces '[goo[3]{w} hoo[3]{5}]{foo $sq[3]{xyz}$ bar 8}'
RemoveInitialSquareBraces '[goo[3]{w} hoo[xyz]{5}]{foo $sq[abc]{xyz}$ bar 9}'
RemoveInitialSquareBraces '[goo[3]{w} hoo[xyz]{uvw}]{foo $sq[abc]{xyz}$ bar 10}'

exit 0

Best Answer

Regarding to above input examples the script can be:

sed s/[^\"\']*[^0-9]\]{\(.*\)}/\1/ <<\END
"[]{foo bar 1}"
"[abc]{foo bar 2}"
"[]{foo[3]{xyz} bar 3}"
"[]{foo $sq[3]{xyz}$ bar 4}"
"[goo{w}]{foo $sq[3]{xyz}$ bar 5}"
"[goo[3]{w}]{foo $sq[3]{xyz}$ bar 6}"
"[goo[3]{w} hoo[3]{5}]{foo $sq[3]{xyz}$ bar 7}"
END

produces

"foo bar 1"
"foo bar 2"
"foo[3]{xyz} bar 3"
"foo $sq[3]{xyz}$ bar 4"
"foo $sq[3]{xyz}$ bar 5"
"foo $sq[3]{xyz}$ bar 6"
"foo $sq[3]{xyz}$ bar 7"

Other thing is your function which can be simplified:

function RemoveInitialSquareBraces {
    printf '%s\n' "$@" |
    sed ...
}

thus it will accept many argument(s).

Update: for more general case you can do the task in two steps:

sed -e "
s/\[.*\[.*\][^[]*\]/[]/  #remove square brackets inside square brackets
s/\[[^]]*\]{\(.*\)\}/\1/ #lazy strip square brackets and curle brackets
"

Addition: you can use perl-grep(GNU grep with perl extention):

grep -Po '\[([^][]*\[\w+\][^][]*)*\]{\K.*(?=})'

or sed with same regexp:

sed 's/\[\([^][]*\(\[\w\+\][^][]*\)*\)*\]{\(.*\)}/\3/'

Related Solutions

Sed Command – Extracting a Regex Match Without Printing Surrounding Characters

When a regexp contains groups, there may be more than one way to match a string against it: regexps with groups are ambiguous. For example, consider the regexp ^.*$[0-9][0-9]*$$ and the string a12. There are two possibilities:

Match a against .* and 2 against [0-9]*; 1 is matched by [0-9].
Match a1 against .* and the empty string against [0-9]*; 2 is matched by [0-9].

Sed, like all other regexp tools out there, applies the earliest longest match rule: it first tries to match the first variable-length portion against a string that's as long as possible. If it finds a way to match the rest of the string against the rest of the regexp, fine. Otherwise, sed tries the next longest match for the first variable-length portion and tries again.

Here, the match with the longest string first is a1 against .*, so the group only matches 2. If you want the group to start earlier, some regexp engines let you make the .* less greedy, but sed doesn't have such a feature. So you need to remove the ambiguity with some additional anchor. Specify that the leading .* cannot end with a digit, so that the first digit of the group is the first possible match.

If the group of digits cannot be at the beginning of the line:
```
sed -n 's/^.*[^0-9]$[0-9][0-9]*$.*/\1/p'
```
If the group of digits can be at the beginning of the line, and your sed supports the \? operator for optional parts:
```
sed -n 's/^$.*[^0-9]$\?$[0-9][0-9]*$.*/\1/p'
```
If the group of digits can be at the beginning of the line, sticking to standard regexp constructs:
```
sed -n -e 's/^.*[^0-9]$[0-9][0-9]*$.*/\1/p' -e t -e 's/^$[0-9][0-9]*$.*/\1/p'
```

By the way, it's that same earliest longest match rule that makes [0-9]* match the digits after the first one, rather than the subsequent .*.

Note that if there are multiple sequences of digits on a line, your program will always extract the last sequence of digits, again because of the earliest longest match rule applied to the initial .*. If you want to extract the first sequence of digits, you need to specify that what comes before is a sequence of non-digits.

sed -n 's/^[^0-9]*\([0-9][0-9]*\).*$/\1/p'

More generally, to extract the first match of a regexp, you need to compute the negation of that regexp. While this is always theoretically possible, the size of the negation grows exponentially with the size of the regexp you're negating, so this is often impractical.

Consider your other example:

sed -n 's/.*\(CONFIG_[a-zA-Z0-9_]*\).*/\1/p'

This example actually exhibits the same issue, but you don't see it on typical inputs. If you feed it hello CONFIG_FOO_CONFIG_BAR, then the command above prints out CONFIG_BAR, not CONFIG_FOO_CONFIG_BAR.

There's a way to print the first match with sed, but it's a little tricky:

sed -n -e 's/\(CONFIG_[a-zA-Z0-9_]*\).*/\n\1/' -e T -e 's/^.*\n//' -e p

(Assuming your sed supports \n to mean a newline in the s replacement text.) This works because sed looks for the earliest match of the regexp, and we don't try to match what precedes the CONFIG_… bit. Since there is no newline inside the line, we can use it as a temporary marker. The T command says to give up if the preceding s command didn't match.

When you can't figure out how to do something in sed, turn to awk. The following command prints the earliest longest match of a regexp:

awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH)}'

And if you feel like keeping it simple, use Perl.

perl -l -ne '/[0-9]+/ && print $&'       # first match
perl -l -ne '/^.*([0-9]+)/ && print $1'  # last match

Sed – How to Match Curly Braces {} with Sed

Don't escape the { or }. Doing so would make sed think you are using a regular expression repetition operator (as in \{1,4\} to match the previous expression between one and four times). This is a basic regular expression operator, and the extended regular expression equivalent is written without the backslashes.

In an extended regular expression (as used with sed -E), you do want to escape both { and }. If you find it hard to remember when to escape and when to not escape these characters, you may always use [{] and [}] to match them literally in both basic and extended expressions.

You also use *. in two places where I think you mean .*. Incidentally, a * at the start of a regular expression (or just after ^ at the start) would match a literal * character.

As for the actual sed command, I would probably use the following:

sed 's/.*\\includegraphics.*{\([^}]*\)}.*/\1/' file.tex

To delete all lines that does not contain any \includegraphics command, you could add a simple d command:

sed -e '/\\includegraphics/!d' \
    -e 's/.*\\includegraphics.*{\([^}]*\)}.*/\1/' file.tex

This would work on your example, but not if the somethingelse at the end of the line contains a { character.