postgresql – Postgres vs MapReduce for Matching Reciprocal Needle Pairs

hadoopperformancepostgresqlscalability

I have a Postgres (latest version) table with about a million lines (a_lat, a_lon, b_lat, b_lon) that each have about a dozen criteria.

These lines go together in reciprocal pairs (A – B, B – A), which I need to determine. There may be multiple lines with similar A and B lat/lon, with similar but different criteria.

The criteria matching isn't exact, so many passes are done with less and less strict criteria (exact match on early rounds, +/- 1 one on later rounds, then +/- 1.5, etc). Matches in each pass are removed from subsequent passes.

I'm currently doing the matching in Postgres by looping over each row in the table and doing a query to find matches:

SELECT
    id
FROM
    locations
WHERE
    a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND 
    a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND 
    criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND 
    criteria_2 = 23 AND 
    criteria_3 = 'ABCD' AND 
    (criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
    criteria_5 = 'A' AND
    criteria_6 = 1

If there is exactly one match then it's recorded and that row removed from future passes.

If you do the math its a ridiculous amount of queries (billions) and takes days to run. All fields have been optimized (floats and integers where possible) and indexed.

Is this type of problem suitable for a MapReduce type solution?

Can MapReduce handle multiple criteria, and criteria ranges (+/- 1)? Can a performance gain be expected on the same hardware (384 GB RAM, 64 core Xeon), or is the performance increase only due to the ability of MapReduce solutions to scale to multiple machines?

Thanks!

Best Answer

I think it could be helpful for you to think of MapReduce as (essentially) a distributed query engine. I know it isn't one-to-one, but with SELECT and aggregate functions such as sum(), max(), etc being very much like a Map() operation, and GROUP BY being very much like a Reduce(), there is quite a bit of similarity.

In your case, on the same hardware, I think the only benefit you will be able gain from using MapReduce is the distribution to multiple cores. Remember that query processes in PostgreSQL only use a single core per query, so there is a lot of waste of your 64 cores if you're running this one query at a time.

Parallelization

Perhaps you'd be able to break up the query and run it over only segments of your locations table, and then run these queries in parallel using a connection pooler like pgBouncer?

For some good info about access data in sequential chunks (pagination), check out this amazing blog post by Markus Winand.

An example of how you might break this into chunks might be (warning:untested SQL ahead)

SELECT
id
FROM
(SELECT * FROM locations LIMIT x OFFSET y) AS locs_fragment
WHERE
a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND 
a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND 
criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND 
criteria_2 = 23 AND 
criteria_3 = 'ABCD' AND 
(criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
criteria_5 = 'A' AND
criteria_6 = 1;

Note that the sub-select SELECT * FROM locations LIMIT x OFFSET y, given no additional criteria, will extract rows in ctid order. By selecting appropriate LIMIT and OFFSET values for a rnage of queries, you can essentially run the query in parallel.

Denormalization

Don't rule out denormalization as an option as well. In this case, it can help you to much more quickly get you the results you want. Of course, denormalization will only really help if your query criteria will be fairly stable over time, so that you can effectively pre-calculate results sets.

I'd say there are two decent options:

(1) Use partial indexing, where a given index is designed around some set of criteria, as

CREATE INDEX idx_locations_criteria_set_0
ON locations (id)
WHERE
a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND 
a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND 
criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND 
criteria_2 = 23 AND 
criteria_3 = 'ABCD' AND 
(criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
criteria_5 = 'A' AND
criteria_6 = 1;

With such an index defined, if you make a query with a matching predicate, an index scan is performed rather than having to perform all the calculations for the SELECT statement.

(2) Another denormalization option is to use triggers during INSERT on your locations table to verify if certain conditions are met, and when they are met, store the id of that locations entry in another table, thus making a simple table scan all you need to do at a later time to find matching entries.

Both of these cases will require additional storage space, as well as some minor INSERT overhead for predicate checking, but it allows you to distribute your predicate checking over time, rather than being forced to do it all at once when a query is issued.

Conclusions

As with any advice on the internet, take mine with a grain of salt! :P I've done some work in both of these areas, but that doesn't make me an expert in your application area. Run some small scale tests on either approach to see what works best for your application.