Text Processing – Convert Output of Tree Command to JSON Format

jsontext processingtree

Is there a convenient way to convert the output of the *nix command "tree" to JSON format?

Edit:
I think I didn't describe my problem well enough. My goal is to convert something like:

.
|-- dir1
|   |-- dirA
|   |   |-- dirAA
|   |   `-- dirBB
|   `-- dirB
`-- dir2
    |-- dirA
    `-- dirB

into:

{"dir1" : [{"dirA":["dirAA", "dirAB"]}, "dirB"], "dir2": ["dirA", "dirB"]}

Best Answer

Attempt 1

A solution using just perl, returning a simple hash of hashes structure. Before the OP clarified data format of JSON.

#! /usr/bin/perl

use File::Find;
use JSON;

use strict;
use warnings;

my $dirs={};
my $encoder = JSON->new->ascii->pretty;

find({wanted => \&process_dir, no_chdir => 1 }, ".");
print $encoder->encode($dirs);

sub process_dir {
    return if !-d $File::Find::name;
    my $ref=\%$dirs;
    for(split(/\//, $File::Find::name)) {
        $ref->{$_} = {} if(!exists $ref->{$_});
        $ref = $ref->{$_};
    }
}

File::Find module works in a similar way to the unix find command. The JSON module takes perl variables and converts them into JSON.

find({wanted => \&process_dir, no_chdir => 1 }, ".");

Will iterate down the file structure from the present working directory calling the subroutine process_dir for each file/directory under ".", and the no_chdir tell perl not to issue a chdir() for each directory it finds.

process_dir returns if the present examined file is not a directory:

return if !-d $File::Find::name;

We then grab a reference of the existing hash %$dirs into $ref, split the file path around / and loop with for adding a new hash key for each path.

Making a directory structure like slm did:

mkdir -p dir{1..5}/dir{A,B}/subdir{1..3}

The output is:

{
   "." : {
      "dir3" : {
         "dirA" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         },
         "dirB" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         }
      },
      "dir2" : {
         "dirA" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         },
         "dirB" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         }
      },
      "dir5" : {
         "dirA" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         },
         "dirB" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         }
      },
      "dir1" : {
         "dirA" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         },
         "dirB" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         }
      },
      "dir4" : {
         "dirA" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         },
         "dirB" : {
            "subdir2" : {},
            "subdir3" : {},
            "subdir1" : {}
         }
      }
   }
}

Attempt 2

Okay now with different data structure...

#! /usr/bin/perl

use warnings;
use strict;
use JSON;

my $encoder = JSON->new->ascii->pretty;   # ascii character set, pretty format
my $dirs;                                 # used to build the data structure

my $path=$ARGV[0] || '.';                 # use the command line arg or working dir

# Open the directory, read in the file list, grep out directories and skip '.' and '..'
# and assign to @dirs
opendir(my $dh, $path) or die "can't opendir $path: $!";
my @dirs = grep { ! /^[.]{1,2}/ && -d "$path/$_" } readdir($dh);
closedir($dh);

# recurse the top level sub directories with the parse_dir subroutine, returning
# a hash reference.
%$dirs = map { $_ => parse_dir("$path/$_") } @dirs;

# print out the JSON encoding of this data structure
print $encoder->encode($dirs);

sub parse_dir {
    my $path = shift;    # the dir we're working on

    # get all sub directories (similar to above opendir/readdir calls)
    opendir(my $dh, $path) or die "can't opendir $path: $!";
    my @dirs = grep { ! /^[.]{1,2}/ && -d "$path/$_" } readdir($dh);
    closedir($dh);

    return undef if !scalar @dirs; # nothing to do here, directory empty

    my $vals = [];                            # set our result to an empty array
    foreach my $dir (@dirs) {                 # loop the sub directories         
        my $res = parse_dir("$path/$dir");    # recurse down each path and get results

        # does the returned value have a result, and is that result an array of at 
        # least one element, then add these results to our $vals anonymous array 
        # wrapped in a anonymous hash
        # ELSE
        # push just the name of that directory our $vals anonymous array
        push(@$vals, (defined $res and scalar @$res) ? { $dir => $res } : $dir);
    }

    return $vals;  # return the recursed result
}

And then running the script on the proposed directory structure...

./tree2json2.pl .
{
   "dir2" : [
      "dirB",
      "dirA"
   ],
   "dir1" : [
      "dirB",
      {
         "dirA" : [
            "dirBB",
            "dirAA"
         ]
      }
   ]
}

I found this pretty damn tricky to get right (especially given the "hash if sub directories, array if not, OH UNLESS top level, then just hashes anyway" logic). So I'd be surprised if this was something you could do with sed / awk ... but then Stephane hasn't looked at this yet I bet :)

Related Solutions

Create a JSON file of all installed dpkg software

A few months ago I wrote a simple ruby script to do a very similar job for our monitoring tool. I only needed the name and the version. I added the short_description and author. Any other fields may require more processing. It's a starting point for something you can build out if you wish.

#!/usr/bin/env ruby

require 'open3'
# json is only necessary for the pretty_generate at end, remove if not needed
require 'json'

allpkgs = {}
# Edit this command to serve your own purposes
cmd = ("dpkg-query -W -f='${binary:Package};${Version};${binary:Summary};${Maintainer}\n'")

dpkgout, stderr, status = Open3.capture3(cmd)
dpkgout.split("\n").each do |line|
  pkginfo = line.split(';')
  allpkgs[pkginfo[0]] = { 'version': pkginfo[1], 'short_description': pkginfo[2], 'author': pkginfo[3] }
end

# pretty JSON print, otherwise use 'puts allpkgs'
puts JSON.pretty_generate(allpkgs)

Shell Script – How to Parse Stdout as a Mix of CSV and JSON

Separating the JSON from the rest is quite easy. This will give you the non JSON only:

python submit.py --provider gt --assignment error-check | sed '/{/,$d'

And this, only the JSON:

python submit.py --provider gt --assignment error-check | sed -n '/{/,$p'

To illustrate, I have saved your example input as file and:

$ sed '/{/,$d' file
Problem,Correct?,Correct Answer,Agent's Answer
"Challenge Problem B-04",0,4,-1
"Basic Problem B-12",0,1,-1
"Challenge Problem B-05",0,6,-1
"Challenge Problem B-07",0,6,-1
"Challenge Problem B-06",0,3,-1
"Basic Problem B-11",0,1,-1
"Basic Problem B-10",0,3,-1
"Challenge Problem B-03",0,3,-1
"Challenge Problem B-02",0,1,-1
"Challenge Problem B-01",0,6,-1
"Challenge Problem B-09",0,4,-1
"Challenge Problem B-08",0,4,-1
"Basic Problem B-08",0,6,-1
"Basic Problem B-09",0,5,-1
"Basic Problem B-04",0,3,-1
"Basic Problem B-05",0,4,-1
"Basic Problem B-06",0,5,-1
"Basic Problem B-07",0,6,-1
"Basic Problem B-01",0,2,-1
"Basic Problem B-02",0,5,-1
"Basic Problem B-03",0,1,-1
"Challenge Problem B-10",0,4,-1
"Challenge Problem B-11",0,5,-1
"Challenge Problem B-12",0,1,-1

And

$ sed -n '/{/,$p' file
{
    "Basic Problems B": {
        "Incorrect": "0",
        "Skipped": "12",
        "Correct": "0",
        "Set": "Basic Problems B"
    },
    "Challenge Problems B": {
        "Incorrect": "0",
        "Skipped": "12",
        "Correct": "0",
        "Set": "Challenge Problems B"
    }
}

Now, you already deal with the non-JSON perfectly well, so I won't change that. Ideally, the JSON data should be parsed using a JSON parser, like jq. Sadly, I don't know enough jq to do this properly, so the best I could come up with is this, rather inelegant, solution. At least it does do what you want (replace cat file with your python submit.py --provider gt --assignment error-check command:

$ cat file | sed -n 's/[,"]//g; s/^ *//; /{/,$p'  | tac | awk -F': ' 'BEGIN{printf "%-30s%-10s%-10s%-10s\n", "Set", "Incorrect", "Skipped", "Correct"} NF==2 && !/\{/{if($1=="Set"){set=$2;data[set]["Incorrect"] = 0;data[set]["Skipped"] = 0;data[set]["Correct"] = 0;} data[set][$1]=$2}END{for(set in data){printf "%-30s%-10s%-10s%-10s\n", set,data[set]["Incorrect"],data[set]["Skipped"],data[set]["Correct"]}}' 
Set                           Incorrect Skipped   Correct   
Challenge Problems B          0         12        0         
Basic Problems B              0         12        0

Putting all this together in a shell script gives:

#!/bin/bash

tmpFile=$(mktemp)
python submit.py --provider gt --assignment error-check > "$tmpFile";

sed '/{/,$d' "$tmpFile" | column -t -s, 
sed -n 's/[,"]//g; s/^ *//; /{/,$p' "$tmpFile" |
  tac |
  awk -F': ' '
    BEGIN{
      printf "%-30s%-10s%-10s%-10s\n", "Set", "Incorrect", "Skipped", "Correct"
    }
    NF==2 && !/\{/{
      if($1=="Set"){
         set=$2;
         data[set]["Incorrect"] = 0;
         data[set]["Skipped"] = 0;
         data[set]["Correct"] = 0;
      } 
      data[set][$1]=$2
    }
    END{
       for(set in data){
         printf "%-30s%-10s%-10s%-10s\n", set, 
                                     data[set]["Incorrect"], 
                                     data[set]["Skipped"], 
                                     data[set]["Correct"]}
    }' 
rm "$tmpFile"

Which produces the following output:

$ foo.sh
Problem                   Correct?  Correct Answer  Agent's Answer
"Challenge Problem B-04"  0         4               -1
"Basic Problem B-12"      0         1               -1
"Challenge Problem B-05"  0         6               -1
"Challenge Problem B-07"  0         6               -1
"Challenge Problem B-06"  0         3               -1
"Basic Problem B-11"      0         1               -1
"Basic Problem B-10"      0         3               -1
"Challenge Problem B-03"  0         3               -1
"Challenge Problem B-02"  0         1               -1
"Challenge Problem B-01"  0         6               -1
"Challenge Problem B-09"  0         4               -1
"Challenge Problem B-08"  0         4               -1
"Basic Problem B-08"      0         6               -1
"Basic Problem B-09"      0         5               -1
"Basic Problem B-04"      0         3               -1
"Basic Problem B-05"      0         4               -1
"Basic Problem B-06"      0         5               -1
"Basic Problem B-07"      0         6               -1
"Basic Problem B-01"      0         2               -1
"Basic Problem B-02"      0         5               -1
"Basic Problem B-03"      0         1               -1
"Challenge Problem B-10"  0         4               -1
"Challenge Problem B-11"  0         5               -1
"Challenge Problem B-12"  0         1               -1
Set                           Incorrect Skipped   Correct   
Challenge Problems B          0         12        0         
Basic Problems B              0         12        0

It feels hacky though, and I hope someone can come up with a cleaner solution with dedicated JSON parsers.

Steeldriver was nice enough to give a proper jq solution in a comment, so if we incorporate that, we get the far simpler (and safer):

#!/bin/bash

tmpFile=$(mktemp)
python submit.py --provider gt --assignment error-check > "$tmpFile";

sed '/{/,$d' "$tmpFile" | column -t -s, 
sed -n '/{/,$p' "$tmpFile" | 
  jq -r '["Set","Incorrect","Skipped","Correct"], (.[] | [.Set,.Incorrect,.Skipped,.Correct]) | @tsv'
 rm "$tmpFile"