Postgresql – SELECT row based on range from columns in second table

join;postgresqlquery

I want to SELECT all rows from first_table INNER JOIN second_table for values where the column in first_table fall within range of values from two columns in second_table.

As example, I have created these two tables

First table:

CREATE TABLE location (loc_id int, gpstime int, lat_start float, lat_end float);

Second table:

CREATE TABLE segments( segment_id int, lat float,foot float,bike float,bus float,car float,metro float);

Then insert these values:

INSERT INTO location
SELECT 1, 12345, 41.10, 41.17 UNION ALL
SELECT 2, 99999, 42.00, 42.12 UNION ALL
SELECT 3,12346, 41.35, 41.30 

INSERT INTO segments
SELECT 1, 41.15, 0, 0, 0.95, 0.025, 0 UNION ALL
SELECT 2, 42.19, 0.95, 0, 0, 0, 0.025 UNION ALL
SELECT 3, 41.33,  0,0, 0, 0, 1

The result set I want should return all rows for which value of column lat in segments table is a value that exists between lat_start and lat_end columns of location table.

So I write the following query:

SELECT gpstime AS timestamp, lat_start,  lat_end,
CASE 
    WHEN GREATEST(car, bike, bus, metro, foot) = car THEN 'car' 
    WHEN GREATEST(car, bike, bus, metro, foot) =  bike THEN 'bike' 
    WHEN GREATEST(car, bike, bus, metro, foot) =  bus THEN 'bus' 
    WHEN GREATEST(car, bike, bus, metro, foot) =  metro THEN 'metro' 
    WHEN GREATEST(car, bike, bus, metro, foot) =  foot THEN 'walking' 
ELSE 'na' END AS mode
FROM location 
INNER JOIN segments
ON  lat BETWEEN lat_start AND lat_end;

Problem: The BETWEEN key word returns TRUE only when check-value fall in range (min, max) and does not evaluate to (max, min). Clearly, the value of lat in the "last" row of segments (41.33) is a value that fall within the range of (lat_start, lat_end) of location (41.30-41.35). However, the BETWEEN clause excludes this row, as per returned the table below:

+-----------+-----------+---------+------+
| timestamp | lat_start | lat_end | mode |
+-----------+-----------+---------+------+
|     12345 | 41.1      | 41.17   | bus  |
+-----------+-----------+---------+------+

How do I ensure all rows satisfying the criteria are returned?

Additional details

I posted this question yesterday, and following David answer, I was glad everything seems to be fixed. However, working on my database looks like there is still an error in the query as yet some rows are still missing. I reproduce minimal scenario of my work once again:

Table 1:
CREATE TABLE location (id int, lat float, lon float, gpstime int);
Table 2:
CREATE TABLE trips( trip_id int, lat_start float, lat_end float,
lon_start float, lon_end float,
travel_mode text);

Add rows:

INSERT INTO location
SELECT 1, 41.2, 8.3, 11111 UNION ALL   --both lat and lon in the interval
SELECT 2, 43.3, 8.5, 22222 UNION ALL   --only lat meets criteria
SELECT 3, 44.0, 8.5,  33333            --only lon meets criteria


INSERT INTO trips
SELECT 1, 41.1, 41.9, 8.0, 8.8, 'foot' UNION ALL  
SELECT 2, 43.1, 43.5, 9.5, 8.9, 'bus' UNION ALL
SELECT 3, 42.1, 42.8, 8.7, 8.0, 'metro'

And then the query, using BETWEEN SYMMETRIC:

SELECT trip_id, gpstime AS timestamp, lat, lon,  travel_mode
FROM location 
INNER JOIN trips 
    ON (lat BETWEEN SYMMETRIC lat_start AND lat_end)  
    AND (lon BETWEEN SYMMETRIC lon_start AND lon_end)

    

Results:

+---------+-----------+------+-----+-------------+
| trip_id | timestamp | lat  | lon | travel_mode |
+---------+-----------+------+-----+-------------+
|       1 |     11111 | 41.2 | 8.3 | foot        |
+---------+-----------+------+-----+-------------+

Yet, row 3 of location table is not included in the result set. What I am missing here?

Note: Using BETWEEN SYMMETRIC doesn't solve the problem yet. It appears value-checking is only applied to lat column and NONE to lon.

Db<>fiddle

I share with you, this db_fiddle I created to test the query. Cleary, all rows of location table should be returned as rows 2 and 3 meet one criteria either.

Best Answer

When writing the answer below I wasn't aware of postgres' BETWEEN SYMETRIC, since pointed out by a_horse_with_no_name. That would be the simplest solution and I assume would be as easy for the query planner & runner to use indexes to serve as just BETWEEN.

In my defense, I'm mainly a SQL Server person... The longer answer is hopefully still useful for others in a similar position but not using prosgres, or where cross-DB compatibility is a requirement.


The ends of a BETWEEN comparison are always assumed to be low and higher, in that order. In most cases X BETWEEN Y AND Z it is simple syntactic sugar for X>=Y AND X<=Z and 41.33>=41.35 AND 41.33<=41.30 obviously isn't true.

You could use CASE to swap the values:

ON  lat BETWEEN 
        (CASE WHEN lat_start<lat_end THEN lat_start ELSE lat_end END)
        AND
        (CASE WHEN lat_start>lat_end THEN lat_start ELSE lat_end END)

or the less verbose version suggested by ypercube™ in the comments:

ON (lat BETWEEN lat_start AND lat_end  ) 
OR (lat BETWEEN lat_end   AND lat_start)

but both of those would be inefficient because your join predicate is no longer saragble so will not result in an indexed search. If the number of rows being considered in each case is small because of other filters then this will not be an issue.

Another option is to UNION two comparisons:

SELECT <stuff>
FROM   location 
INNER JOIN 
       segments
       ON lat BETWEEN lat_start AND lat_end
WHERE  lat_start <= lat_end
UNION ALL
SELECT <stuff>
FROM   location 
INNER JOIN 
       segments
       ON lat BETWEEN lat_end AND lat_start
WHERE  lat_start > lat_end
;

(here the extra WHERE in each half stops you getting two rows for instances when lat_start = lat_end, so we can use the more efficient UNION ALL instead of plain UNION without worrying about duplicates in those cases)

This means you get indexed lookups for both queries instead of a scan of segments.
(Technically it would be possible for a query planner to decompose ypercube™'s example into the two lookups like the UNION example, but I'm not aware of a query planner that does this - there is a limit to how much thinking a QP can do before it starts taking too much time planning in the general case, compared to the time spent running the result.)

You could change how you are storing your data and insist that the line/area is always defined bottom-left to top-right (so you have lat_left and lat_right instead of lat_start and lat_end) then your original query will word. If direction of travel is important then you can add that as a separate boolean value: 0 for left-to-right, 1 for right-to-left, or similar.

You might want to look into extensions that exist for postgres specifically designed for geometric (including geological, lat+long based) data types and related queries: https://postgis.net
They may do a chunk of your work for you, after the initial learning curve, especially if you start needing to consider more complex shapes.