Postgresql – Cities/Countries Autocomplete query optimization

performancepostgresql

I have Cities and Countries tables, based on Geonames data.
I'm trying to implement an autocomplete function, but so far it doesn't work as I intend it to.

CREATE INDEX gin_pgtrm_testcity_big_index ON testcity_big USING gin ((name || ' ' || asciiname || ' ' || alternatenames) COLLATE pg_catalog."default"
 gin_trgm_ops);

CREATE INDEX gin_pgtrm_tescountry_big_index ON testcountry_big USING gin ((name) COLLATE pg_catalog."default" gin_trgm_ops);

SELECT
    country.name AS country,
    city.name AS city,
    city.population AS city_population,
    city.fcode AS fscode,
    
    similarity(coalesce(country.name,'') 
                || coalesce(city.name,'') 
                || coalesce(city.asciiname,'') 
    ,'%ribe dk%) AS rank
    
FROM testcountry_big country
JOIN testcity_big city ON city.country = country.iso_alpha2

WHERE similarity(coalesce(country.name,'') 
                    || coalesce(city.name,'') 
                    || coalesce(city.asciiname,'')
     ,'%ribe dk%') > 0.15
        
AND fcode ~ '\yPPLC\y|\yPPLA\y|\yPPLA2\y|\yPPL\y|\yHMSD\y'
ORDER BY rank DESC
LIMIT 5;

EXPLAIN BUFFERS, ANALYZE :

Limit  (cost=12037.76..12038.34 rows=5 width=37) (actual time=533.056..534.754 rows=5 loops=1)
  Buffers: shared hit=4824
  ->  Gather Merge  (cost=12037.76..17173.78 rows=44020 width=37) (actual time=533.054..534.748 rows=5 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=4824
        ->  Sort  (cost=11037.73..11092.76 rows=22010 width=37) (actual time=528.410..528.413 rows=4 loops=3)
              Sort Key: (similarity((((COALESCE(country.name, ''::character varying))::text || COALESCE(city.name, ''::text)) || COALESCE(city.asciiname, ''::text)), '%ribe dk%'::text)) DESC
              Sort Method: top-N heapsort  Memory: 25kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
              Buffers: shared hit=4824
              ->  Hash Join  (cost=10.67..10672.16 rows=22010 width=37) (actual time=1.219..524.476 rows=7239 loops=3)
                    Hash Cond: (city.country = (country.iso_alpha2)::text)
                    Join Filter: (similarity((((COALESCE(country.name, ''::character varying))::text || COALESCE(city.name, ''::text)) || COALESCE(city.asciiname, ''::text)), '%ribe dk%'::text) > '0'::double precision)
                    Rows Removed by Join Filter: 36625
                    Buffers: shared hit=4722
                    ->  Parallel Seq Scan on testcity_big city  (cost=0.00..5633.16 rows=54502 width=36) (actual time=0.571..162.228 rows=43864 loops=3)
                          Filter: (fcode ~ '\yPPLC\y|\yPPLA\y|\yPPLA2\y|\yPPL\y|\yHMSD\y'::text)
                          Rows Removed by Filter: 20275
                          Buffers: shared hit=4631
                    ->  Hash  (cost=7.52..7.52 rows=252 width=13) (actual time=0.328..0.329 rows=252 loops=3)
                          Buckets: 1024  Batches: 1  Memory Usage: 20kB
                          Buffers: shared hit=15
                          ->  Seq Scan on testcountry_big country  (cost=0.00..7.52 rows=252 width=13) (actual time=0.036..0.135 rows=252 loops=3)
                                Buffers: shared hit=15
Planning Time: 0.279 ms
Execution Time: 534.819 ms

1. I cannot return country only. My current query retrieves cities only, for example for "New York United States" I get 4 different cities, but I'm not getting United States country, and I can't understand how to make it work for all possible combinations: cities, countries , cities+countries.

2. This query performs too slow for an autocomplete function – around 0.5 seconds.

(a) should I add a Materialized View to help boost the performance?

(b) could it be so slow simply because I'm using a local machine (laptop) and the performance on a real server is expected to be better?

** I also tried working with tsvector here, but :

  1. unfortunately they only work for whole words and not substrings
  2. the performance seemed worse than with triagrams
  3. the similarity function using triagrams gives the ability for the users to be more "loose" with the queries, and get results even when they have a typo.

Best Answer

I can't understand how to make it work for all possible combinations: cities, countries , cities+countries.

Run separate queries, and combine the results with a UNION or UNION ALL.

This query performs too slow for an autocomplete function - around 0.5 seconds.

See https://dba.stackexchange.com/tags/postgresql-performance/info for asking effective performance questions. Especially EXPLAIN (ANALYZE, BUFFERS).

Your queries can't be using your pg_trgm indexes for multiple reasons. The "similarity" function is not directly indexable. For this similarity query, you would have to be using % or <-> to be indexable (and the latter only with GiST, not GIN). Expression indexes can only be used when the expression used to define the index matches the expression used in the query. That is not the case in your query. Furthermore, it can't be the case, as the query expression pulls column from multiple tables, and you can't have an index span tables.

(a) should I add a Materialized View to help boost the performance?

If you want to have an index which spans columns from different tables, then that is how you would need do it. (Or denormalize). For example:

CREATE MATERIALIZED VIEW city as 
 SELECT country.name AS country,
    city.name AS city,
    city.population AS city_population,
    city.fcode AS fscode,
    (COALESCE(country.name, ''::text) || COALESCE(city.name, ''::text)) || COALESCE(city.asciiname, ''::text) AS combined_name
   FROM testcountry_big country
     JOIN testcity_big city ON city.country = country.iso_alpha2
  WHERE city.fcode ~ '\yPPLC\y|\yPPLA\y|\yPPLA2\y|\yPPL\y|\yHMSD\y'::text;

create index on city using gin (combined_name gin_trgm_ops);

select country, city, city_population from city 
    where combined_name % 'ribe dk'
    order by  combined_name <-> 'ribe dk' limit 5;

That is one way you would do it if that is what you wanted to do, but I don't know if it is a good idea. Autocomplete and spell correction are generally different things. People rarely start typing things from the middle of the word, so LIKE '%rib%' or its similarity variants is probably not going to offer the most relevant completions.