I'm using a long gawk 3.1.6 script to do a complex conversion of Zim markdown text into GtkDialog code and am stuck on the following problem…
Sample ASCII input…
[[link|label label]] [[link]] @tag more text
Commandline test to find right regex…
re="[][][][]"; echo '[[link|label label]] [[link]] @tag more text' | awk -v RE=$re '{split($0,A,RE); printf "\n(" A[1] ")(" A[2] ")(" A[3] ")(" A[4] ")(" A[5] ")(" A[6] ")(" A[7] ")(" A[8] ")\n"}'
The regex "[][][][]"
splits out the two hyperlink forms quite nicely so that's not a problem.
It would be more understandable if we could divided it in two —
"[][]"
and"[][]"
. We are looking for either "[[" or "]]" to split
on. The order of the characters in the class have to be reversed to
comply with class meta-character restrictions.
The problem is in also splitting out the "@tag" into just "tag". "tag" could be any alphanumeric text either followed by a space or the end of the string.
Executing the commandline test above yields…
()(link|label label)( )(link)( @tag more text)()()
But I need it to yield…
()(link|label label)( )(link)( )(tag)(more text)
I've tried numerous regex strings like "[][][][]|@[[:alnum:]]*"
which drops the entire word and yields…
()(link|label label)( )(link)( )( more text)()
and "[][][][]|@"
which yields…
()(link|label label)( )(link)( )(tag more text)()
Any ideas?
Best Answer
I don't think you can do this in a single regex, but since you're using gawk, you can use some gawk extensions:
split()
functionmatch()
function