Duplicate, with a few small changes, a few lines in a text file

awktext processing

I'm trying to work out how to replicate a single range of lines in a text file. The range starts with a line that is unique in the file but the range ends with a line that can exist in multiple places in the file.

Here's some example input I need to process:

I have no imagination
so this sample text will
Common
be boring. But it does
demonstrate the problem
I am trying to solve.
Common
Hi mom!
This is a unique line.
And here is some more
text that should be copied
as well.
Common
Followed by text that should
not be copied.

The lines I need to duplicate and modify are in bold to point them out here.

The output I need is:

I have no imagination
so this sample text will
Common
be boring. But it does
demonstrate the problem
I am trying to solve.
Common
Hi mom!
This is a changed line.
And here is different more
text that should be copied
as well.
Common
This is a unique line.
And here is some more
text that should be copied
as well.
Common
Followed by text that should
not be copied.

The additional output is in bold to make it clear.

I need to get the range of lines starting with the line:

This is a unique line

and ending with the line:

Common

That range of lines must be inserted before just before the original range of lines. The copy of the matching range of lines will need to be modified slightly.

The "Common" line that ends the range can itself occur in many places within the file.

I came up with a working awk script but it seems far more complicated than it needs to be. My awk skills are non-existent.

/This is a unique line/{flag=1}
/Common/{
    if (flag > 0) {
        n=m;
        sub("some","different",n);
        sub("unique","changed",n);
        print n "\n" $0 "\n" m;
        m=""
    };
    flag=0
};
flag{
    if (length(m) > 0) {
        m=m "\n" $0
    } else {
        m=$0
    }
}
!flag{ print }

Is there a cleaner, less verbose way to implement this? I'm open to other options besides awk. It just needs to be a standard command available on macOS.

Best Answer

awk '/This is a unique line/,/Common/{
   H = H RS $0
   if ( $0 ~ /Common/ ) {
      g = H
      sub("\n","",g)
      sub("some","different",g)
      sub("unique","changed",g)
      $0 = g H
   } else { next }
}1'   inputfile

Here's the sed code(I showed in the Answer section) translated into awk.

Note that, the code you are having you are taking on the responsibility of turning ON/OFF the awk variable flag to keep track of lines. But whereas, awk already does it for you under the hood the exact same thing when you use it's range operator ,

Related Solutions

Join Lines of Text with Repeated Beginning – Command Line Tips

This is standard procedure for awk

awk '
{
  k=$2
  for (i=3;i<=NF;i++)
    k=k " " $i
  if (! a[$1])
    a[$1]=k
  else
    a[$1]=a[$1] "<br>" k
}
END{
  for (i in a)
    print i "\t" a[i]
}' long.text.file

If file is sorted by first word in line the script can be more simple

awk '
{
  if($1==k)
    printf("%s","<br>")
  else {
    if(NR!=1)
      print ""
    printf("%s\t",$1)
  }
  for(i=2;i<NF;i++)
    printf("%s ",$i)
  printf("%s",$NF)
  k=$1
}
END{
print ""
}' long.text.file

Or just bash

unset n
while read -r word definition
do
    if [ "$last" = "$word" ]
    then
        printf "<br>%s" "$definition"
    else 
        if [ "$n" ]
        then
            echo
        else
            n=1
        fi
        printf "%s\t%s" "$word" "$definition"
        last="$word"
     fi
done < long.text.file
echo

Text file containing filenames and hashes – extracting lines with duplicate hashes

you could do worse than this two-pass awk solution

awk 'NR == FNR{if ($2 in a) b[$2]++;a[$2]++; next}; $2 in b' file file

In the first pass, use array b to keep track of hash values that are encountered more than once. In the second pass, print a record if it's hash exists within b

Alternately

sort -k2,2 file | uniq -f 1 -D

which involves sorting the file by the second field and piping to uniq to print all duplicate records (skipping the first field while comparing via the -f 1). Given the size of your input file this could turn out to be quite resource-intensive

Best Answer

Related Solutions

Join Lines of Text with Repeated Beginning – Command Line Tips

Text file containing filenames and hashes – extracting lines with duplicate hashes

Related Question