Find a specific string and delete the whole structure

text processingxml

I have a vertical file where each word (token) is on the separate line in 4 columns. There are also metastructures <doc>, <s>, …
The documents looks as follows:

<doc name="sth" url="http">
<p>
<s>
Here   here   k1gInSc1   here
is   be   k1gMnPc2   be
a  a   k2eAgMnPc1d1   a
sentence   sentence   k1gMnPc1   sentence
<g/>
.       .       kIx.
</s>
</p>
</doc>

the problem is that sometimes there is wrong encoding with characters as Ă or Ä in the first column, e.g.

<doc name="sth" url="http">
<p>
<s>
Here   here   k1gInSc1   here
is   be   k1gMnPc2   be
Ă  Ă   k?   Ă
sentence   sentence   k1gMnPc1   sentence
<g/>
.       .       kIx.
</s>
</p>
</doc>

I would need to find these characters and delete the whole document structure. So, if I find Ă on a line, I need to delete the whole content between <doc...> all lines </doc>.

My file has a billion lines and ca a few thousand lines contain wrong encoded characters.

I used grep to finding bad characters:

xzcat file.vert.xz | grep -i "Ă\|Ĺ\|ľ\|ş\|Ä" > file_bad_characters.txt

How can I detect these characters and delete not only the line but the whole text between <doc>structures.

Best Answer

The right way to do this is to use a proper XML parser. However, in this case, the following might work as a workaround:

  1. Remove all blank lines from the file:

    sed -i '/^\s*$/d' file
    
  2. Add a blank line before each <doc>:

    sed -i 's/<doc/\n\n<doc/' file 
    
  3. User Perl's "paragraph mode" where "lines" are defined as "paragraphs" (sections of text preceded by an empty line):

    perl -00 -ne 'print unless /[ĂĹľşÄ]/' file > newfile
    

    Or, to make the replacements in the original file:

    perl -i.bak -00 -ne 'print unless /[ĂĹľşÄ]/' file
    

IMPORTANT: This assumes a well-structured file where everything is inside <doc... tags.

Related Question