I'm working on a large JSON dataset (1GB+) and need to merge similar array objects, then merge their nested similar objects.
Starting with, I have rows such as:
[{"item": "item1", "attributes":[{"type": "itemtype", "colour": ["blue"]}]},
{"item": "item1", "attributes":[{"type": "itemtype", "colour": ["grey"]}]},
{"item": "item2", "attributes":[{"type": "itemtype", "colour": ["blue"]}]},
{"item": "item2", "attributes":[{"type": "itemtype2", "colour": ["orange"]}]},
{"item": "item2", "attributes":[{"type": "itemtype2", "colour": ["blue"]}]},
{"item": "item3", "attributes":[{"type": "itemtype", "colour": ["blue"]}]}]
I have used jq to group by and pretty print these with the code:
jq 'group_by(.item) | map({"item": .[0].item, "attributes": map(.attributes[])})
To group by the item, and sort of combine the attributes:
[
{
"item": "item1",
"attributes": [
{
"type": "itemtype",
"colour": [
"blue"
]
},
{
"type": "itemtype",
"colour": [
"grey"
]
}
]
},
{
"item": "item2",
"attributes": [
{
"type": "itemtype",
"colour": [
"blue"
]
},
{
"type": "itemtype2",
"colour": [
"orange"
]
},
{
"type": "itemtype2",
"colour": [
"blue"
]
}
]
},
{
"item": "item3",
"attributes": [
{
"type": "itemtype",
"colour": [
"blue"
]
}
]
}
]
My challenge is grouping these nested attributes together, grouping by type and adding the colours to one array based on the type. So for example I'd have something like:
[
{
"item": "item1",
"attributes": [
{
"type": "itemtype",
"colour": [
"blue",
"grey"
]
}
]
},
{
"item": "item2",
"attributes": [
{
"type": "itemtype",
"colour": [
"blue"
]
},
{
"type": "itemtype2",
"colour": [
"orange",
"blue"
]
}
]
},
{
"item": "item3",
"attributes": [
{
"type": "itemtype",
"colour": [
"blue"
]
}
]
}
]
I've tried online editors that use Lodash or JMESPath to try and understand it better, as well as tried to add another map()
within map(.attributes[])
but am not getting anywhere. I think I need to add reduce() somewhere, but I can't get my head around it.
Thanks!
Best Answer
Thank you for this question. It was enjoyable to work with.
Here's a variation without any
reduce
:I start by grouping the original elements by their
item
key. This gives us an array that is grouped byitem
. Each element in that array consists of a sub-array containing all the original objects with the sameitem
.The first
map()
brings these groups together by creating one object for each group. The object hasitem
andattributes
keys, and the value for theitem
key is taken arbitrarily from the first element of the group (they are all the same).A further
group_by()
andmap()
creates the value for theattributes
key. This time grouping over thetype
key down in theattributes
array of the original objects and, for each created group, collecting thetype
andcolour
values from the original objects.You can also do it with
reduce
, like so:This is more or less
group_by()
+map(reduce)
to create the outer structure of the result, i.e., the grouping of the data into parts based onitem
and organising that outer structure. The values for theattributes
array are just passed along.This pattern (
group_by()
+map(reduce)
) is then repeated for each group'sattributes
array for grouping and organising it based on thetype
values.Performance-wise, the two solutions above are on par with each other for small inputs, with the second variant (using
reduce
) being a bit faster for inputs less than 150 times the size in the question (about 64 KB). For these size inputs, a run takes about 70 ms on my system, i.e. the run time is negligible.When the input grows further, the
reduced
-based solution seems to require exponentially more time. For an input of size 4 MB, the first code uses around 800 ms, while the second clocks in at 25 seconds for a single run.If
jq
becomes difficult to run for larger inputs (a linear extrapolation suggest it should take me about 54 hours to run a 1 GB data set), you may want to look into processing the data in some way (potentially in a database or at least work with the data in some other format which is not JSON).For example, the given data could be transformed into CSV:
... or into an equivalent database table, and then worked on from there.
For example, with Miller,