Postgresql – Levenshtein function returning different result based on limit value

full-text-searchfunctionspostgresqlstringstring-searching

The Levenshtein function is not working as I would have expect. Is there anything that I minsunderstood?

Here is the query:

SELECT c0.id, c0.engine_type, c0.mpg, c0.kwh, c0.price, c0.make, c0.model, c0.vin, c0.inserted_at, c0.updated_at 
FROM cars AS c0 ORDER BY LEAST(levenshtein(c0.model, 'Camry'), levenshtein(c0.make, 'Toyota')) 
LIMIT 5

Executing this query will return me the following data:

5   "electric"      257 32288   "Toyota"    "Camry" "SW081452D50423138" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
20  "gasoline"  83      68851   "Toyota"    "Camry" "643VN327D4ZH04928" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
4   "gasoline"  74      74482   "Toyota"    "Corolla"   "1K48R780410S27945" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
10  "gasoline"  73      87040   "Dodge" "Ram"   "J22782VG240639409" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
3   "electric"      116 66560   "Audi"  "A5"    "94V5772ZB4BJ23179" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"

As you can see above, the first two matches are the Toyota Camrys; which I would expect from the query.
However, when I change the LIMIT attribute to 10 for instance,

SELECT c0.id, c0.engine_type, c0.mpg, c0.kwh, c0.price, c0.make, c0.model, c0.vin, c0.inserted_at, c0.updated_at 
FROM cars AS c0 ORDER BY LEAST(levenshtein(c0.model, 'Camry'), levenshtein(c0.make, 'Toyota')) 
LIMIT 10

I get a different result:

4   "gasoline"  74      74482   "Toyota"    "Corolla"   "1K48R780410S27945" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
5   "electric"      257 32288   "Toyota"    "Camry" "SW081452D50423138" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
20  "gasoline"  83      68851   "Toyota"    "Camry" "643VN327D4ZH04928" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
10  "gasoline"  73      87040   "Dodge" "Ram"   "J22782VG240639409" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
7   "electric"      274 41661   "Dodge" "Charger"   "FDND794KFW0179068" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
8   "gasoline"  57      42369   "BMW"   "M3"    "NS7V3N1VW5J508253" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
9   "electric"      214 15710   "BMW"   "X5"    "3VUFCG07ATW125829" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
11  "electric"      417 63167   "Nissan"    "Juke"  "6800ULHC7H0857158" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
12  "gasoline"  78      21059   "Lincoln"   "MKX"   "AFUCF3SUG6W287040" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"
13  "electric"      348 93954   "Lincoln"   "MKS"   "2L64A6Z18XR348145" "2020-10-06 14:48:27"   "2020-10-06 14:48:27"

Here above, for a reason beyond my understanding, the Toyota Corolla comes before the two Camrys. How come the Levenshtein distance is smaller for "Toyota Corolla" than "Toyota Camry" when I'm explictely searching for "Toyota Camry" and I limit the rows returned by the query to 10?

Any idea why?

Best Answer

The least() function returns the lesser of it's two arguments; both sets of rows have a perfect match for 'Toyota', which will have a distance of 0. So the ultimate evaluated ORDER BY value for each of the 3 rows is 0.

Adding additional ORDER BY expressions of each of the arguments to least() should do what you want: ORDER BY LEAST( levenshtein(c0.model, 'Camry'), levenshtein(c0.make, 'Toyota')), levenshtein(c0.model, 'Camry'), levenshtein(c0.model, 'Toyota') will return all Camrys first, then any non-Camry Toyotas, and then rows whose least() expression evaluated to something higher than 0.