Ubuntu – grepping patterns in a json file

command linegrepjsontext processing

How can I select the lines from my text files similar to this one

"created_at": "Wed Oct 19 12:36:54 +0000 2016"

basically I need to find lines with the pattern

starts with Wed Oct 19 and
ends with 2016

However, the Wed Oct 19 12:36:54 +0000 2016 could be anywhere in the line and any other time of the day could be in between.

When I use

grep -irn "Wed Oct 19" | grep -irn "2016"

I get all sorts of unwanted results.

Here's an example of a similar line from the file I don't want to match:

"created_at": "Tue Jan 31 18:50:26 +0000 2012",

Thid is part of a tweet's attributes.

Here's a longer part of the input:

 "contributors": null, 
      "retweeted": false, 
      "in_reply_to_user_id_str": null, 
      "place": null, 
      "retweet_count": 4, 
      "created_at": "Sun Apr 03 23:48:36 +0000 2011", 
      "retweeted_status": {
            "text": "In preparation for the NFL lockout, I will be spending twice as much time analyzing my fantasy baseball team during company time. #PGP", 
            "truncated": false, 
            "in_reply_to_user_id": null, 
            "in_reply_to_status_id": null,

complete example input here:
https://gist.github.com/hrp/900964

UPDATE: I am looking for the file names that contain this pattern in them.

Best Answer

If it could be anywhere in the line, and anything could be in between, I guess

grep -wirn 'Wed Oct 19 .* 2016' *

should get it...

If you only want the filenames, use -l

grep -wirl 'Wed Oct 19 .* 2016' *

Notes

-w use word boundaries in case the text you want is stuck onto something else we don't want to match (unlikely in this case)
-l just print the filenames of files that contain the match
.* any number of any characters here

It's probably OK to parse this file with grep especially for something so simple, but usinga JSON parser as mentioned in David Foerster's answer is the Right Way (i.e. it will likely be more reliable, especially if you need to do anything complex).

Related Solutions

How to Grep for Multiple Patterns on Multiple Lines

Updated 18-Nov-2016 (since grep behavior is changed: grep with -P parameter now doesn't support ^ and $ anchors [on Ubuntu 16.04 with kernel v:4.4.0-21-generic])(wrong (non-)fix)

$ grep -Pzo "begin(.|\n)*\nend" file
begin
Some text goes here.  
end

note: for other commands just replace the '^' & '$' anchors with new-line anchor '\n' ______________________________

With grep command:

grep -Pzo "^begin\$(.|\n)*^end$" file

If you want don't include the patterns "begin" and "end" in result, use grep with Lookbehind and Lookahead support.

grep -Pzo "(?<=^begin$\n)(.|\n)*(?=\n^end$)" file

Also you can use \K notify instead of Lookbehind assertion.

grep -Pzo "^begin$\n\K(.|\n)*(?=\n^end$)" file

\K option ignore everything before pattern matching and ignore pattern itself.
\n used for avoid printing empty lines from output.

Or as @AvinashRaj suggests there are simple easy grep as following:

grep -Pzo "(?s)^begin$.*?^end$" file

grep -Pzo "^begin\$[\s\S]*?^end$" file

(?s) tells grep to allow the dot to match newline characters.
[\s\S] matches any character that is either whitespace or non-whitespace.

And their output without including "begin" and "end" is as following:

grep -Pzo "^begin$\n\K[\s\S]*?(?=\n^end$)" file # or grep -Pzo "(?<=^begin$\n)[\s\S]*?(?=\n^end$)"

grep -Pzo "(?s)(?<=^begin$\n).*?(?=\n^end$)" file

see the full test of all commands here (_{out of dated as grep behavior with -P parameter is changed})

Note:

^ point the beginning of a line and $ point the end of a line. these added to the around of "begin" and "end" to matching them if they are alone in a line.
In two commands I escaped $ because it also using for "Command Substitution"($(command)) that allows the output of a command to replace the command name.

From man grep:

-o, --only-matching
      Print only the matched (non-empty) parts of a matching line,
      with each such part on a separate output line.

-P, --perl-regexp
      Interpret PATTERN as a Perl compatible regular expression (PCRE)

-z, --null-data
      Treat the input as a set of lines, each terminated by a zero byte (the ASCII 
      NUL character) instead of a newline. Like the -Z or --null option, this option 
      can be used with commands like sort -z to process arbitrary file names.

Ubuntu – Searching for specialized patterns using grep in a json file

Adding -z to your grep options will make grep treat newlines as null terminating characters (\0) as opposed to separate lines however they do not seem to be matchable in the regex. The workaround for this is to simply match everything (.*) up until the end of your desired pattern (in your case "created_at").

Next you can add -o to have grep only output what is actually matched, otherwise it outputs the whole file (since it is now essentially one giant line). Alternatively if the only purpose of outputting to a file is to later wc -l I would instead suggest you use grep's -c option which will print the number of matches rather than the match itself.

This translates to the following command:

grep -wirnEzc '},.*created_at' *

Expanding on this to include your previous pattern as well we get:

grep -wirnEzc '},.*created_at":\s"Wed Oct 19 2(1:[0-5][0-9]:[0-5][0-9]|2:([0-2][0-9]:[0-5][0-9]|30:00)) .* 2016' *