Extract .json from a text file with arbitrary text

jqjsontext processing

I have an output from a program that give some arbitrary text, with .json stuff inside like:

blablablabla
blablab some more text

blablablabla
blablab some more text
{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}


blablablabla
blablab some more text


blablablabla
blablab some more text

I want to clean the text outside the .json to parse it with "jq".

I need only this text:

{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

Thanks!

Best Answer

sed '/^{/,/^}/!d' < input

Would extract the portions of the file comprised between lines that start with { and the next line after that that starts with }.

pcregrep -Mo '(?s)(\{(?:[^{}"]++|"(?:\\.|[^"])*+"|(?1))*\})' < file

Would extract the pairs of top-level {...}s wherever they are, being smart enough to cope with input like {"x":{"y":1}} (nested {}) or { "x}" } (} inside strings), or { "x\"}" } (escaped quotes in strings).

If you don't have and can't install pcregrep (comes with the PCRE library), but you have GNU grep, built with PCRE, you can replace with grep -zo though that loads the whole file in memory. Or use perl -l -0777 -ne 'print for m{regexp-above}g'.

Related Solutions

How to Extract Data from a JSON File

You can use jq to process json files in shell.

For example, I saved your sample json file as raul.json and then ran:

$ jq .message.temperature raul.json 
409.5
25.1
409.5
$ jq .message.humidity raul.json 
null
40
null

jq is available pre-packaged for most linux distros.

There's probably a way to do it in jq itself, but the simplest way I found to get both the wanted values on one line is to use xargs. For example:

$ jq 'select(.message.id == 1490) | .message.temperature, .message.humidity' raul.json | xargs
25.1 40

or, if you want to loop through each .message.id instance, we can add .message.id to the output and use xargs -n 3 as we know that there will be three fields (id, temperature, humidity):

jq '.message.id, .message.temperature, .message.humidity' raul.json | xargs -n 3
4095 409.5 null
1490 25.1 40
2039 409.5 null

You could then post-process that output with awk or whatever.

Finally, both python and perl have excellent libraries for parsing and manipulating json data. As do several other languages, including php and java.

How to extract a value from a JSON file containing an encoded JSON object

The JSON document that you get from your command seems to contain another encoded JSON document. It's from this encoded document you appear to want to get the data.

To get at the internal document, we may use jq:

aws ... |
jq -r '.Policy'

To get the value of the Effect key from the bit that contains that aws:SecureTransport key from this, we need to parse the document again:

aws ... |
jq -r '.Policy' |
jq -r '.Statement[] | select(.Condition.Bool."aws:SecureTransport").Effect'

The last jq call goes through all the elements of the Statement array, looking for one that has a key called .Condition.Bool."aws:SecureTransport". It then gets the value of the Effect key associated with that Statement element.

Running this on your data outputs the value Deny.

If you want the value of that .Condition.Bool."aws:SecureTransport" key (false in your document), use .Condition.Bool."aws:SecureTransport" in place of .Effect above.

Alternatively, use the fromjson instruction in jq instead of a second jq invocation:

aws ... |
jq -r '.Policy | fromjson | .Statement[] | select(.Condition.Bool."aws:SecureTransport").Effect'

Here, fromjson decodes the encoded JSON document and passes it to the later stages of processing.

Just for reference, the internal encoded JSON document looks like this (aws ... | jq -r '.Policy | fromjson'):

{
  "Version": "2012-10-17",
  "Id": "S3SecureTransportPolicy",
  "Statement": [
    {
      "Sid": "ForceSSLOnlyAccess",
      "Effect": "Deny",
      "Principal": {
        "AWS": "*"
      },
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::amn/*",
      "Condition": {
        "Bool": {
          "aws:SecureTransport": "false"
        }
      }
    },
    {
      "Sid": "AWSCloudTrailAclCheck20150319",
      "Effect": "Allow",
      "Principal": {
        "Service": "cloudtrail.amazonaws.com"
      },
      "Action": "s3:GetBucketAcl",
      "Resource": "arn:aws:s3:::amn"
    },
    {
      "Sid": "AWSCloudTrailWrite20150319",
      "Effect": "Allow",
      "Principal": {
        "Service": "cloudtrail.amazonaws.com"
      },
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::amn/AWSLogs/405042254276/*",
      "Condition": {
        "StringEquals": {
          "s3:x-amz-acl": "bucket-owner-full-control"
        }
      }
    }
  ]
}

Best Answer

Related Solutions

How to Extract Data from a JSON File

How to extract a value from a JSON file containing an encoded JSON object

Related Question