PostgreSQL – Finding Nearest Geo Points Across Two Tables

postgispostgresqlspatial

I have two tables, both have lat/lon columns within same country (UK). Tables sizes are roughly 80M and 50M.

Along with the lat/lon columns I have created a geo-indexes for both tables in this way:

SELECT AddGeometryColumn('my_table_50/80', 'geom', 4326, 'POINT', 2);
UPDATE my_table SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);
CREATE INDEX my_table_geom_idx ON my_table USING gist(geom);

To find the nearest points from the table with 80M respect to 50M within 0.1 miles I run something like:

   SELECT A.latitude, A.longitude, B.latitude, B.longitude,
   FROM my_table_50 AS A, my_table_80 AS B
   where ST_Distance(A.geom, B.geom) < 0.1609 -- 1 mile / 10
   ORDER BY ST_Distance(A.geom, B.geom) ASC LIMIT 1;

The query is very slow to run (pretty much is cartesian 50M X 80M).

Is there a way to speed it up?

Also, is it really useful to use "postgis geo indexing" for a problem like this?
Using "pythagorean theorem" can be enough (like in the chosen answer here https://stackoverflow.com/questions/1664799/calculating-distance-between-two-points-using-pythagorean-theorem), as I am expecting distances way far shortest than earth radius or it may lead to some error?

Best Answer

Don't use ST_Distance. It will never use an index. Instead use KNN distance with <->, and use ST_DWithin when possible.

To find the nearest points from the table with 80M respect to 50M within 0.1 miles I run something like:

To find the nearest point, you can do..

SELECT A.latitude, A.longitude, B.latitude, B.longitude,
FROM my_table_50 AS A, my_table_80 AS B
ORDER BY A.geom <-> B.geom
LIMIT 1

To find the nearest point within one mile,

SELECT A.latitude, A.longitude, B.latitude, B.longitude,
FROM my_table_50 AS A, my_table_80 AS B
WHERE ST_DWithin(A.geom, B.geom, 1609.34)
ORDER BY A.geom <-> B.geom
LIMIT 1

Also, unless your postgis is old don't ever do

SELECT AddGeometryColumn('my_table_50/80', 'geom', 4326, 'POINT', 2);

From the docs,

Changed: 2.0.0 This function no longer updates geometry_columns since geometry_columns is a view that reads from system catalogs. It by default also does not create constraints, but instead uses the built in type modifier behavior of PostgreSQL. So for example building a wgs84 POINT column with this function is now equivalent to: ALTER TABLE some_table ADD COLUMN geom geometry(Point,4326);

Make sure in the above you have a spatial/gist index on A.geom and B.geom. And, consider clustering both of them on those indexes.

Related Solutions

Mysql – Mongodb – Returning the points nearest to a location, with distance

MongoDB has built in support for geoindexing. You don't need to do the calculation yourself.

Basically, you would create a field with the lat/long stored as an array or as sub documents, something like one of these:

{ loc : [ 50 , 30 ] } //SUGGESTED OPTION
{ loc : { x : 50 , y : 30 } }
{ loc : { lon : 40.739037, lat: 73.992964 } }

Then index the new loc field appropriately:

db.places.ensureIndex( { loc : "2d" } )

Finally you can then use one of the operators to query a point for the nearest 20 results:

db.places.find( { loc : { $near : [50,50] } } ).limit(20)

You could, of course, just use MongoDB to store the data, then pull the information out of the DB with a find() and do the calculation client-side but I imagine that is not what you want to do.

If the distance part of the equation is what you want:

http://www.mongodb.org/display/DOCS/Geospatial+Indexing#GeospatialIndexing-geoNearCommand

The $geoNear operator will return the distance also. An example:

> db.runCommand( { geoNear : "places" , near : [50,50], num : 10 } );
{
        "ns" : "test.places",
        "near" : "1100110000001111110000001111110000001111110000001111",
        "results" : [
                {
                        "dis" : 69.29646421910687,
                        "obj" : {
                                "_id" : ObjectId("4b8bd6b93b83c574d8760280"),
                                "y" : [
                                        1,
                                        1
                                ],
                                "category" : "Coffee"
                        }
                },
                {
                        "dis" : 69.29646421910687,
                        "obj" : {
                                "_id" : ObjectId("4b8bd6b03b83c574d876027f"),
                                "y" : [
                                        1,
                                        1
                                ]
                        }
                }
        ],
        "stats" : {
                "time" : 0,
                "btreelocs" : 1,
                "btreelocs" : 1,
                "nscanned" : 2,
                "nscanned" : 2,
                "objectsLoaded" : 2,
                "objectsLoaded" : 2,
                "avgDistance" : 69.29646421910687
        },
        "ok" : 1
}

The "dis" : 69.29646421910687 elements are what you are looking for, there is also a spherical distance option.

For all this, how to use the distances, and more, take a look here for more information on geo indexes and how to use them:

http://www.mongodb.org/display/DOCS/Geospatial+Indexing/

Optimizing Intersect Query Between Huge Spatial Tables in PostgreSQL

Update:

Since this was published, the problem evolve and now we deal with 140kk+ rows. However, Postgis also evolved and it is now possible to "fix" the the feature table. No more need of Dice from ArcMap.

I did it using a ST_VoronoiPolygons approach. I created a working gist with a function that breaks features on the original table in feasible sizes for processing.

Original Answer:

For information, I was able to do this using other tools and splitting the workload:

To everything work fine, I had to limit each feature in 800 vertices max and 15 km² max. I tried some tools and even a recurring procedure on plsql, but without success. The only thing that I tried and split everything correctly was the Dice feature of ArcMap;
I divided my 170k+ rows in chunks of 20 rows and ran six instances of the query in parallel to compute the area using a .net console app;
With the area stored, I was able to do some processing, also in chunks and parallel, to compute the value of each single 9kk+ cell.

This process now takes "only" 3 hours to finish.

The ST_Intersects is light. The problem was this calculation:

st_area(st_intersection(grade.the_geom, uso.geom)).

Build the intersection of complex features and calculate the area was the demanding task.