I have an output from a program that give some arbitrary text, with .json stuff inside like:
blablablabla
blablab some more text
blablablabla
blablab some more text
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
blablablabla
blablab some more text
blablablabla
blablab some more text
I want to clean the text outside the .json to parse it with "jq".
I need only this text:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
Thanks!
Best Answer
Would extract the portions of the file comprised between lines that start with
{
and the next line after that that starts with}
.Would extract the pairs of top-level
{...}
s wherever they are, being smart enough to cope with input like{"x":{"y":1}}
(nested{}
) or{ "x}" }
(}
inside strings), or{ "x\"}" }
(escaped quotes in strings).If you don't have and can't install
pcregrep
(comes with the PCRE library), but you have GNUgrep
, built with PCRE, you can replace withgrep -zo
though that loads the whole file in memory. Or useperl -l -0777 -ne 'print for m{regexp-above}g'
.