I have a vertical file where each word (token) is on the separate line in 4 columns. There are also metastructures <doc>
, <s>
, …
The documents looks as follows:
<doc name="sth" url="http">
<p>
<s>
Here here k1gInSc1 here
is be k1gMnPc2 be
a a k2eAgMnPc1d1 a
sentence sentence k1gMnPc1 sentence
<g/>
. . kIx.
</s>
</p>
</doc>
the problem is that sometimes there is wrong encoding with characters as Ă or Ä in the first column, e.g.
<doc name="sth" url="http">
<p>
<s>
Here here k1gInSc1 here
is be k1gMnPc2 be
Ă Ă k? Ă
sentence sentence k1gMnPc1 sentence
<g/>
. . kIx.
</s>
</p>
</doc>
I would need to find these characters and delete the whole document structure. So, if I find Ă on a line, I need to delete the whole content between <doc...>
all lines </doc>
.
My file has a billion lines and ca a few thousand lines contain wrong encoded characters.
I used grep to finding bad characters:
xzcat file.vert.xz | grep -i "Ă\|Ĺ\|ľ\|ş\|Ä" > file_bad_characters.txt
How can I detect these characters and delete not only the line but the whole text between <doc>
structures.
Best Answer
The right way to do this is to use a proper XML parser. However, in this case, the following might work as a workaround:
Remove all blank lines from the file:
Add a blank line before each
<doc>
:User Perl's "paragraph mode" where "lines" are defined as "paragraphs" (sections of text preceded by an empty line):
Or, to make the replacements in the original file:
IMPORTANT: This assumes a well-structured file where everything is inside
<doc...
tags.