Delete All But the Last Comment Line for Each Comment Block

awksedtext processing

  • Goal: Delete all but the last comment line for each comment block. If the file ends with a comment block, delete it completely. Each comment line begins with a #.

  • Command that I tried

    sed -z -e 's/#.*\n#/#/g' "${InputP}"
    
  • Input file

    # Life/Living
    # Life/Passion
    - [Mindfulness.md](file:///home/nikhil/Documents/Git/Life/Passion/PassionSrc/Sports/Yoga/Mindfulness/Mindfulness.md)
    # Life/PersonalManagement
    # Life/Social
    # Linux/AmazingNotes
    # Linux/Backintime
    # Linux/DotFiles
    # Linux/GitScripts
    - [Peaceful.m3u](file:///home/nikhil/Documents/Git/../Mobile/Documents/PortableNotes/PortableNotesSrc/SocialActivity/Music/SongsPlaylist/Data/Peaceful.m3u)
    - [AuxiliaryFiles.sh](file:///home/nikhil/Documents/Git/Linux/GitScripts/GitScriptsSrc/GitInit/GitNew/Src/AuxiliaryFiles.sh)
    # PythonWs/NumericalProgramming
    # PythonWs/Python
    # PythonWs/ScientificComputing
    
  • Expected Output

    # Life/Passion
    - [Mindfulness.md](file:///home/nikhil/Documents/Git/Life/Passion/PassionSrc/Sports/Yoga/Mindfulness/Mindfulness.md)
    # Linux/GitScripts
    - [Peaceful.m3u](file:///home/nikhil/Documents/Git/../Mobile/Documents/PortableNotes/PortableNotesSrc/SocialActivity/Music/SongsPlaylist/Data/Peaceful.m3u)
    - [AuxiliaryFiles.sh](file:///home/nikhil/Documents/Git/Linux/GitScripts/GitScriptsSrc/GitInit/GitNew/Src/AuxiliaryFiles.sh)
    
  • But I get this Output
    # PythonWs/ScientificComputing
    

Does anyone know how to solve the problem?

Best Answer

Using GNU sed with slurp mode -z and utilizing extended regexes -E we can do as shown:

$ sed -Ez '
    s/(^|\n)(#[^\n]*\n)+$/\1/
    s/(^|\n)(#[^\n]*\n)+/\1\2/g
' file
  • Remove a trailing comment block.
  • Remove all comment blocks but keep the last line in each.

The GNU sed model is as follows:

  • Sed reads a file line by line unless -z is in effect, when it reads the whole file. The record separator by default is a newline \n unless -z is in use then it is \0 the NULL ascii.
  • After reading in a record, the trailing record separator is clipped and the resulting string is stored in the pattern space register. The pattern space is where all the sed commands operate.
  • Now let's say there are 5 sed commands in our sed script. Then the first one is applied on the pattern space, this modifies the pattern space and on this modified pattern space the next sed command is applied ... and so forth sequentially till the last. Then the pattern space is printed to stdout unless the -n is in effect. After this the next record is read in and the same sequence of sed commands are applied to the pattern space.

Please note that the above is a very simplified narrative, valid when no flow control commands are used in the sed script.

Yes you are right, in the slurp mode the $ signifies the end of file as also the end of pattern space since there is just one pattern space.

When you have this construct (regex)+ then the brackets hold the last regex match because of the greedy nature of regexes.

Alternatively, it can also be done as

$ sed -e '
    /^#/{h;d;} 
    H;z;x;s/^\n//
' file