Postgresql – How to speed-up the query on geo-location processes

gist-indexindexpostgresqlspatial

I have a table that contains 10,301,390 GPS records, cities, countries and IP address blocks. I have user's current location with latitude and longitude. I created this query:

SELECT
  *, point(45.1013021, 46.3021011) <@> point(latitude, longitude) :: point AS distance
FROM
  locs
WHERE
  (
    point(45.1013021, 46.3021011) <@> point(latitude, longitude)
  ) < 10 -- radius
ORDER BY
  distance LIMIT 1;

This query successfully gave me what I want, but it is slow. It took 2 to 3 seconds to get one record by given latitude and longitude.

I tried a B-Tree index on the latitude and longitude columns, also tried GIST( point(latitude, longitude)); but still querying is slow.

How can I speed up this query?

Update:

It seems slowness is caused by the ORDER BY but I want to get the shortest distance, so the question remains.

Best Answer

You may consider using a GIST index based on using the function ll_to_earth. This index will allow for fast "nearby" searches.

CREATE INDEX 
   ON locs USING gist (ll_to_earth(lat, lng));

Once you have this index, your query should be done in a different way.

Your (lat, lng) pairs need to be converted to the earth type, and compared with the indexed values (which are of the same type). Your query will need to have two conditions, one for "approximate" result, and one for the "precise" one. The first one will be able to use the previous index:

SELECT
    *
FROM
    locs
WHERE
    /* First condition allows to search for points at an approximate distance:
       a distance computed using a 'box', instead of a 'circumference'.
       This first condition will use the index.
       (45.1013021, 46.3021011) = (lat, lng) of search center. 
       25000 = search radius (in m)
    */
    earth_box(ll_to_earth(45.1013021, 46.3021011), 25000) @> ll_to_earth(lat, lng) 

    /* This second condition (which is slower) will "refine" 
       the previous search, to include only the points within the
       circumference.
    */
    AND earth_distance(ll_to_earth(45.1013021, 46.3021011), 
             ll_to_earth(lat, lng)) < 25000 ;

For using this code, you need two extensions (included in most PostgreSQL distributions):

CREATE EXTENSION IF NOT EXISTS cube ;
CREATE EXTENSION IF NOT EXISTS earthdistance;

This is the documentation for them:

Cube. You should take a look at the description of the @> operator. This module is needed by the next one.
EarthDistance. You will find here information about earth_box and earth_distance. This module assumes that the earth is spherical, which is an approximation good enough for the majority of applications.

A test with a table consisting of 2.2 million rows taken from the Free World Cities Database gives me the following answer to the previous query (which is not exactly the same as yours):

"ru","andra-ata","Andra-Ata","24",,44.9509,46.3327
"ru","andratinskiy","Andratinskiy","24",,44.9509,46.3327
"ru","chernozemelskaya","Chernozemelskaya","24",,44.9821,46.0622
"ru","gayduk","Gayduk","24",,44.9578,46.5244
"ru","imeni beriya","Imeni Beriya","24",,45.0208,46.3906
"ru","imeni kirova","Imeni Kirova","24",,45.2836,46.4847
"ru","kumskiy","Kumskiy","24",,44.9821,46.0622
"ru","kumskoy","Kumskoy","24",,44.9821,46.0622
"ru","lopas","Lopas","17",,44.937,46.1833
"ru","pyatogo dekabrya","Pyatogo Dekabrya","24",,45.1858,46.1656
"ru","svetlyy erek","Svetlyy Erek","24",,45.0079,46.4408
"ru","ulan tuk","Ulan Tuk","24",,45.1542,46.1097

To have an "order of magnitude" idea about timings: pgAdmin III is telling me that the time to get this answer is 22 ms. (PostgreSQL 9.6.1 with "out-of-the-box" parameters, on a Mac with Mac OS 10.12, Core i7, SSD)

Related Solutions

Sql-server – SQL Server 2008 – Return records in chronological order, given a separation factor

Prior to SQL 2008, the most common solution was to use a UDF to calculate the great-circle distance between two points on a sphere. The Haversine formula is probably the most commonly used method.

Of course the Earth is not actually a perfect sphere, but this was considered "good enough" for most uses.

In SQL 2008, as you anticipated, such calculations are simplified and made more accurate by the introduction of the Geography and Geometry data types. Here's a brief sample of how you can use them to simplify distance calculations.

DECLARE @locations TABLE(locname VARCHAR(100), coord geography)
DECLARE @loc1 geography
DECLARE @loc2 geography


INSERT INTO @locations 
VALUES('HOME', geography::Point(-81.810194, 41.478156, 4326))   --Note: Lat, Long, SRID
                                                                --The 4326 is the SRID (spatial reference id) used by SQL as 
                                                                --a reference to the WGS 84 Standard. This is the same reference
                                                                --used by the GPS system
INSERT INTO @locations 
VALUES('WORK', geography::Point(-81.687771, 41.498227, 4326))

SELECT * FROM @locations

SELECT @loc1 = coord FROM @locations WHERE locname = 'HOME'
SELECT @loc2 = coord FROM @locations WHERE locname = 'WORK'

SELECT @loc1.STDistance(@loc2) * 3.2808399  --STDistance is in meters so we multiply to convert to feet 

SELECT @loc1.STIntersects(@loc2.STBuffer(300 / 3.2808399)) as isWithin300Ft  --This formula returns True when the point @loc1 intersects with the 300ft buffer zone around @loc2

The SRID is the key to the improved accuracy. The WGS 84 specification to which it refers includes a standardized coordinate system and a reference ellipsoid. In other words, it accounts for the non-spherical nature of the Earth, giving better results than a pure spherical Great Circle calculation.

If GIS accuracy is important to your work, this is the simplest way to implement it in SQL 2008.

Postgresql – Fast hamming distance queries in postgres

Well, I spent a while looking at writing a custom postgres C extension, and wound up just writting a Cython database wrapper that maintains a BK-tree structure in memory.

Basically, it maintains a in-memory copy of the phash values from the database, and all updates to the database are replayed into the BK-tree.

It's all up on github here. It also has a LOT of unit-tests.

Querying across a dataset of 10 million hash values for items with a distance of 4 results in touching ~0.25%-0.5% of the values in the tree, and takes ~100 ms.

Update:

Best Answer

Related Solutions

Sql-server – SQL Server 2008 – Return records in chronological order, given a separation factor

Postgresql – Fast hamming distance queries in postgres

Related Question