MongoDB MapReduce – Troubleshooting Output Errors with Large Input Queries

mongodbmongodb-3.2

Please can you help identify why this mapReduce is failing (some of time), and propose how to ensure it works all the time?

I'm using MongoDB's mapReduce to perform grouping of array values by dates, and then determining the average value at each date. This is generally working well, however I am finding some points are failing, with the map reduce returning "nan" values and I cannot work out why. What is more strange is that, although the points that fail are consistent if I re-run the function, they do not always fail when I run the map-reduce on fewer numbers of documents, even though the documents which go into the particular group that fails will not change. Hopefully the figures below will clarify what I mean.

Given a load of points on a map (dots), the map function groups them together into boxes based on their location. Each point has an array of dates and values. The reduce function adds together these values at each date, and the finalize function calculates the average value for each date, giving an average value for each square on the map.

Figure 1. Points grouped into squares, where the red point (and possibly pink, although it is on the boundary) has not been added.

The example in Figure 1. (not the same square as Figure 2. and 3.) shows what I mean with regards to the error, and although the point in the red circle certainly should be inside that square (the non green square), the mapReduce function has errored here and failed to add the point.

To demonstrate the inconsistent behavior, below I show the map reduce function performed on a small area of a city.

Figure 2. mapReduce function working fine on a small area, red box drawn to indicate the problem square.

Here is the identical mapReduce function performed on a smaller subset of my, which was achieved by modifying the query: {$geoWithin…} function to only select documents within a small polygon that fits in the figure. The next figure is the same mapReduce function but with the query :{$geoWithin} selection 1/8th of the UK, shown as the in the red square of Figure 4.

Figure 3. Same mapReduce function as used in Figure 2. but now with an error in the summation.

As can be seen in Figure 3, most squares have processed fine and produced the same result. However there is one square shown here (and several others elsewhere), that failed, and upon querying the output of the mapReduce, they result in "nan" values.

With the working version, the document in the red box looks like:

{                                                               
    "_id" : "18_129961_84424",                              
    "geometry" : {                                          
        "type" : "Polygon",                             
        "coordinates" : [[                                       
                [-1.525726318359375, 53.79335064315454],
                [-1.525726318359375, 53.794161837371036],
                [-1.52435302734375, 53.794161837371036],
                [-1.52435302734375, 53.79335064315454],
                [-1.525726318359375, 53.79335064315454]
            ]]
    },
    "properties" : [
        {
            "date" : ISODate("2015-08-15T00:00:00Z"),
            "sum" : -9.486295223236084,
            "points" : 4,
            "displace" : -2.371573805809021
        }
    ]
}

Whereas on the broken version, the same document looks like:

{                                                               
    "_id" : "18_129961_84424",                              
    "geometry" : {                                          
        "type" : "Polygon",                             
        "coordinates" : [[                                       
                [-1.525726318359375, 53.79335064315454],
                [-1.525726318359375, 53.794161837371036],
                [-1.52435302734375, 53.794161837371036],
                [-1.52435302734375, 53.79335064315454],
                [-1.525726318359375, 53.79335064315454]
            ]]
    },
    "properties" : [
        {
            "date" : ISODate("2015-08-15T00:00:00Z"),
            "sum" : NaN,
            "points" : 3,
            "displace" : NaN
        }
    ]
}

The fact that this square can process sometimes, means I don't doubt the data, but something is happening in the mapReduce function. The four points that should have correctly summed together are:

{ "_id" : ObjectId("57a888d4c7afa6e97e7fe00c"), "geometry" : { "type" : "Point", "coordinates" : [ -1.5254854131489992, 53.79415290802717 ] }, "properties" : [ { "date" : ISODate("2015-08-15T00:00:00Z"), "displace" : -2.3721842765808105 } ] }
{ "_id" : ObjectId("57a888d4c7afa6e97e7fe37a"), "geometry" : { "type" : "Point", "coordinates" : [ -1.5254854131489992, 53.79335290752351 ] }, "properties" : [ { "date" : ISODate("2015-08-15T00:00:00Z"), "displace" : -2.382347822189331 } ] }
{ "_id" : ObjectId("57a888d4c7afa6e97e7fe37b"), "geometry" : { "type" : "Point", "coordinates" : [ -1.52468541264534, 53.79335290752351 ] }, "properties" : [ { "date" : ISODate("2015-08-15T00:00:00Z"), "displace" : -2.372774124145508 } ] }
{ "_id" : ObjectId("57a888d4c7afa6e97e7fe00d"), "geometry" : { "type" : "Point", "coordinates" : [ -1.52468541264534, 53.79415290802717 ] }, "properties" : [ { "date" : ISODate("2015-08-15T00:00:00Z"), "displace" : -2.3589890003204346 } ] }

I suspect it may be an issue in my Reduce function not being idempotent, as proposed in the answer to MongoDB MapReduce returning unexpected results and grouping twice, but I'm not sure if that is true, and if so, I'm unsure how to ensure it is idempotent in this case. For completeness I include my actual mapReduce function below.

var map = function(){
    function lon2tile (lon, zoom){ return Math.floor((lon+180)/360*Math.pow(2,zoom)); }
    function lat2tile (lat, zoom){ return Math.floor((1-Math.log(Math.tan(lat*Math.PI/180) + 1/Math.cos(lat*Math.PI/180))/Math.PI)/2 *Math.pow(2,zoom)); }
    function tile2long(x,z) { return (x/Math.pow(2,z)*360-180); }
    function tile2lat(y,z) {
        var n=Math.PI-2*Math.PI*y/Math.pow(2,z);
        return (180/Math.PI*Math.atan(0.5*(Math.exp(n)-Math.exp(-n))));
    }
    function tile2poly(x, y, z){
        xl = tile2long(x, z);
        yt = tile2lat(y,z);
        xr = tile2long(x+1,z);
        yb = tile2lat(y+1,z);
        poly = [[
            [xl, yb],
            [xl, yt],
            [xr, yt],
            [xr, yb],
            [xl, yb]
        ]];
        return poly
    }

    var zoom = 18;
    var lon = this.geometry.coordinates[0];
    var lat = this.geometry.coordinates[1];
    var xtile = lon2tile(lon, zoom);
    var ytile = lat2tile(lat, zoom);

    var key = zoom+'_'+xtile+'_'+ytile;
    var poly = tile2poly(xtile, ytile, zoom);

    var value = {
        geometry: {type: 'Polygon', coordinates: poly},
        properties: this.properties
    };

    for(var idx=0; idx< value.properties.length; idx++){
        value.properties[idx].points = 1;
    };

    emit (key, value);
}


var reduce = function(mapKey, mapVal){
    redVal = {
        "geometry" : mapVal[0].geometry,
        "properties": []
    };

    for(var idx=0; idx< mapVal.length; idx++){
        for(var pidx=0; pidx< mapVal[idx].properties.length; pidx++){
            loc = -1;
            for (var el=0; el<redVal.properties.length; el++){
                if(redVal.properties[el].date.toISOString() == mapVal[idx].properties[pidx].date.toISOString()){
                    loc = el;
                    break;
                }
            }

            if (loc == -1){
                redVal.properties.push({'date': mapVal[idx].properties[pidx].date,
                                        'sum': mapVal[idx].properties[pidx].displace,
                                        'points': 1});
            }
            else{
                redVal.properties[loc].sum += mapVal[idx].properties[pidx].displace;
                redVal.properties[loc].points += mapVal[idx].properties[pidx].points;
            }
        }
    };

    return redVal;
}

var final = function(redKey, redVal){
    for (var el=0; el<redVal.properties.length; el++){
        if (!("sum" in redVal.properties[el])){
            redVal.properties[el].sum = redVal.properties[el].displace;
        }

        redVal.properties[el].displace = redVal.properties[el].sum / redVal.properties[el].points;
    } 

    return redVal;
}

var query_in = {
    'geometry': {
        '$geoIntersects': {
            '$geometry': {
                'type': 'Polygon',
                'coordinates': [[
                    [-2.8125, 53.33087298301705],
                    [-2.8125, 54.1624339680678],
                    [-1.40625, 54.1624339680678],
                    [-1.40625, 53.33087298301705],
                    [-2.8125, 53.33087298301705]
                ]]                            
            }
        }
    }
}

db.c0.mapReduce(map, reduce, {out: "mrTest", query:query_in, finalize:final})

Upon some further investigation I see that these errors are only appearing near the end of the map reduce process (see image below). The data that will be selected in the mapReduce query stage are all the points inside the red box. The gap to the right edge is excepted, as there is no data there.

Figure 4. UK coverage, showing missing point further south.


Having done some further investigation of the single problem square in Figure 3, I can see that the map part is correctly grouping 4 points. This was achieved by building an array during the reduce phase, where it pushes each mapped value into an array each time it is called. This shows points to the point in the reduce phases that failes, although I cannot understand why.

By modifying the reduce function to:
function(mapKey, mapVal){
redVal = {
"all_mapped": [],
"geometry" : mapVal[0].geometry,
"properties": []
};

    for(var idx=0; idx< mapVal.length; idx++){
        redVal.all_mapped.push({'iter':idx, 'map': mapVal[idx]});
        for(var pidx=0; pidx< mapVal[idx].properties.length; pidx++){
            var loc = -1;
            for (var el=0; el<redVal.properties.length; el++){
                if(redVal.properties[el].date.toISOString() === mapVal[idx].properties[pidx].date.toISOString()){
                    loc = el;
                    break;
                }
            }

            if (loc === -1){
                redVal.properties.push({'date': mapVal[idx].properties[pidx].date,
                                        'sum': mapVal[idx].properties[pidx].displace,
                                        'points': mapVal[idx].properties[pidx].points});
            }
            else{
                redVal.properties[loc].sum += mapVal[idx].properties[pidx].displace;
                redVal.properties[loc].points += mapVal[idx].properties[pidx].points;
            }
        }
    };

    return redVal;
};

it is possible to see in the output that the reduce function is called twice. The first time it has 3 mapped values which is sums together correctly. Then the reduce function is called a second time to combine the previously reduced output with the 1 additional point. I believe this is where the summation fails.

Best Answer

A variety of things I would try:

  • Polygons are using floats, is it possible x/y tiles are wrong for some points due to rounding errors inherent in that datatype (though I don't see why this wouldn't be idempotent).
  • Runs of holes seem to follow lats rather than lons so this points more to something code related rather than db related (would expect any db weirdness to be lat/lon equally).
  • Can you verify the correctness of your map and reduce functions? Could write unittests for primary and corner cases.
  • In your reduce, your loc is not var'd - is it possible this global can be accessed by parallel executors?
  • Date comparison should use === - and these definitely aren't auto-updated or anything - could change over a longer map-reduce session?
  • More debug/print statements to log intermediate states for the failing tiles; https://stackoverflow.com/questions/13963483/how-to-get-print-output-for-debugging-map-reduce-in-mongoid
  • For performance, pass functions in the scope rather than have them defined in every map; https://stackoverflow.com/questions/7273379/how-to-use-variables-in-mongodb-map-reduce-map-function