Sed: insert something before subsequent lines that begin the same, but are not the same

sedtext processing

I have a LaTeX file with one glossary entry per line:

...
\newglossaryentry{ajahn}{name=Ajahn,description={\textit{(Thai)} From the Pali \textit{achariya}, a Buddhist monk's preceptor: `teacher'; often used as a title of the senior monk or monks at monastery. In the West, the forest tradition uses it for all monks and nuns of more than ten years' seniority}}
\newglossaryentry{ajivaka}{name={\=Aj\={\i}vaka},description={Sect of contemplatives contemporary with the Buddha who held the view that beings have no volitional control over their actions and that the universe runs according to fate and destiny}}
...

Here we're only concerned about the \newglossaryentry{label} part of each line.

The lines of the file has been sorted with sort, so duplicate labels come up like this:

\newglossaryentry{anapanasati}{name=\=an\=ap\=anasati,description={`Awareness of inhalation and exhalation'; using the breath, as a mediation object},sort=anapanasati}
\newglossaryentry{anapanasati}{name={\=an\=ap\=anasati},description={Mindfulness of breathing. A meditation practice in which one maintains one's attention and mindfulness on the sensations of breathing. \textbf{[MORE]}}}

How do I sed this file, to insert a line before duplicate labels?

#!/bin/sh

cat glossary.tex | sed '
/\\newglossaryentry[{][^}]*[}]/{
    N;
    s/^\(\\newglossaryentry[{][^}]*[}]\)\(.*\)\n\1/% duplicate\n\1\2\n\1/;
}' > glossary.sed.tex

I made it as far as the command above, but it has a flaw: it reads the lines to pattern space in pairs, and so it only works when the duplicate happens to be the pair it read in.

These will not match for example:

\newglossaryentry{abhinna}{name={abhi\~n\~n\=a},description={Intuitive powers that come from the practice of concentration: the ability to display psychic powers, clairvoyance, clairaudience, the ability to know the thoughts of others, recollection of past lifetimes, and the knowledge that does away with mental effluents (see \textit{asava}).}}
\newglossaryentry{acariya}{name={\=acariya},description={Teacher; mentor. See \textit{kalyanamitta.}}}
\newglossaryentry{acariya}{name=\=acariya,description={Teacher},see=Ajahn}
\newglossaryentry{adhitthana}{name={adhi\d{t}\d{t}h\=ana},description={Determination; resolution. One of the ten perfections \textit{(paramis).}}}

Because first it reads in the lines with abhinna and acariya , then it reads acariya and aditthana .

I figure that this needs some extra sed magic with hold space and conditional printing of lines, but I couldn't get my head around it.

Best Answer

This is quite complicated for sed, more of a job for awk or perl. Here's a script that finds consecutive duplicates (but allows non-matching lines in between):

perl -l -pe '
    if (/^ *\\newglossaryentry[* ]*{([^{}]*)}/) {
        print "% duplicate" if $1 eq $prev;
        $prev = $1;
    }'

It's easy enough to detect duplicates even in unsorted input.

perl -l -pe '
    if (/^ *\\newglossaryentry[* ]*{([^{}]*)}/) {
        print "% duplicate" if $seen{$1};
        ++$seen{$1};
    }'

You can also easily restrict to consecutive lines:

perl -l -pe '
    if (/^ *\\newglossaryentry[* ]*{([^{}]*)}/) {
        print "% duplicate" if $1 eq $prev;
        $prev = $1;
    } else {undef $prev}'