jq – Nested Reduce and Grouping of JSON Data

bashjqjson

I'm working on a large JSON dataset (1GB+) and need to merge similar array objects, then merge their nested similar objects.
Starting with, I have rows such as:

[{"item": "item1", "attributes":[{"type": "itemtype", "colour": ["blue"]}]},
{"item": "item1", "attributes":[{"type": "itemtype", "colour": ["grey"]}]},
{"item": "item2", "attributes":[{"type": "itemtype", "colour": ["blue"]}]},
{"item": "item2", "attributes":[{"type": "itemtype2", "colour": ["orange"]}]},
{"item": "item2", "attributes":[{"type": "itemtype2", "colour": ["blue"]}]},
{"item": "item3", "attributes":[{"type": "itemtype", "colour": ["blue"]}]}]

I have used jq to group by and pretty print these with the code:

 jq 'group_by(.item) | map({"item": .[0].item, "attributes": map(.attributes[])})

To group by the item, and sort of combine the attributes:

[
  {
    "item": "item1",
    "attributes": [
      {
        "type": "itemtype",
        "colour": [
          "blue"
        ]
      },
      {
        "type": "itemtype",
        "colour": [
          "grey"
        ]
      }
    ]
  },
  {
    "item": "item2",
    "attributes": [
      {
        "type": "itemtype",
        "colour": [
          "blue"
        ]
      },
      {
        "type": "itemtype2",
        "colour": [
          "orange"
        ]
      },
      {
        "type": "itemtype2",
        "colour": [
          "blue"
        ]
      }
    ]
  },
  {
    "item": "item3",
    "attributes": [
      {
        "type": "itemtype",
        "colour": [
          "blue"
        ]
      }
    ]
  }
]

My challenge is grouping these nested attributes together, grouping by type and adding the colours to one array based on the type. So for example I'd have something like:

[
  {
    "item": "item1",
    "attributes": [
      {
        "type": "itemtype",
        "colour": [
          "blue",
          "grey"
        ]
      }
    ]
  },
  {
    "item": "item2",
    "attributes": [
      {
        "type": "itemtype",
        "colour": [
          "blue"
        ]
      },
      {
        "type": "itemtype2",
        "colour": [
          "orange",
          "blue"
        ]
      }
    ]
  },
  {
    "item": "item3",
    "attributes": [
      {
        "type": "itemtype",
        "colour": [
          "blue"
        ]
      }
    ]
  }
]

I've tried online editors that use Lodash or JMESPath to try and understand it better, as well as tried to add another map() within map(.attributes[]) but am not getting anywhere. I think I need to add reduce() somewhere, but I can't get my head around it.

Thanks!

Best Answer

Thank you for this question. It was enjoyable to work with.


Here's a variation without any reduce:

group_by(.item) |
map({
    item: first.item,
    attributes: (
        group_by(.attributes[].type) |
        map({
            type: first.attributes[].type,
            colour: ( map(.attributes[].colour[]) )
        })
    )
})

I start by grouping the original elements by their item key. This gives us an array that is grouped by item. Each element in that array consists of a sub-array containing all the original objects with the same item.

The first map() brings these groups together by creating one object for each group. The object has item and attributes keys, and the value for the item key is taken arbitrarily from the first element of the group (they are all the same).

A further group_by() and map() creates the value for the attributes key. This time grouping over the type key down in the attributes array of the original objects and, for each created group, collecting the type and colour values from the original objects.


You can also do it with reduce, like so:

group_by(.item) |
map(reduce .[] as $a ({}; .item = $a.item | .attributes += $a.attributes)) |
map(.attributes |= (
    group_by(.type) |
    map(reduce .[] as $a ({}; .type = $a.type | .colour += $a.colour))
))

This is more or less group_by()+map(reduce) to create the outer structure of the result, i.e., the grouping of the data into parts based on item and organising that outer structure. The values for the attributes array are just passed along.

This pattern (group_by()+map(reduce)) is then repeated for each group's attributes array for grouping and organising it based on the type values.


Performance-wise, the two solutions above are on par with each other for small inputs, with the second variant (using reduce) being a bit faster for inputs less than 150 times the size in the question (about 64 KB). For these size inputs, a run takes about 70 ms on my system, i.e. the run time is negligible.

When the input grows further, the reduced-based solution seems to require exponentially more time. For an input of size 4 MB, the first code uses around 800 ms, while the second clocks in at 25 seconds for a single run.

If jq becomes difficult to run for larger inputs (a linear extrapolation suggest it should take me about 54 hours to run a 1 GB data set), you may want to look into processing the data in some way (potentially in a database or at least work with the data in some other format which is not JSON).

For example, the given data could be transformed into CSV:

item,type,colour
item1,itemtype,blue
item1,itemtype,grey
item2,itemtype,blue
item2,itemtype2,orange
item2,itemtype2,blue
item3,itemtype,blue

... or into an equivalent database table, and then worked on from there.

For example, with Miller,

$ mlr --csv nest --ivar ';' -f colour file.csv
item,type,colour
item1,itemtype,blue;grey
item2,itemtype,blue
item2,itemtype2,orange;blue
item3,itemtype,blue
Related Question