Extract .json from a text file with arbitrary text

jqjsontext processing

I have an output from a program that give some arbitrary text, with .json stuff inside like:

blablablabla
blablab some more text

blablablabla
blablab some more text
{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}


blablablabla
blablab some more text


blablablabla
blablab some more text

I want to clean the text outside the .json to parse it with "jq".

I need only this text:

{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

Thanks!

Best Answer

sed '/^{/,/^}/!d' < input

Would extract the portions of the file comprised between lines that start with { and the next line after that that starts with }.

pcregrep -Mo '(?s)(\{(?:[^{}"]++|"(?:\\.|[^"])*+"|(?1))*\})' < file

Would extract the pairs of top-level {...}s wherever they are, being smart enough to cope with input like {"x":{"y":1}} (nested {}) or { "x}" } (} inside strings), or { "x\"}" } (escaped quotes in strings).

If you don't have and can't install pcregrep (comes with the PCRE library), but you have GNU grep, built with PCRE, you can replace with grep -zo though that loads the whole file in memory. Or use perl -l -0777 -ne 'print for m{regexp-above}g'.

Related Question