Postgresql – How to simplify a nested SELECT with PostgreSQL arrays

postgresql

I'm trying to speed up and simplify a SQL query against an imported OpenStreetMap database (OSM). The database is stored in a PostgreSQL 9.2.4 server.

This OSM import features two particular tables, planet_osm_rels and planet_osm_ways. In the first table exists an relation for country boundaries, which I can extract by querying an hstore column, tags_hstore. The resulting members attribute then contains an text array giving me a bunch of information, including which ways are part of this relation. The ID of the particular ways are prefixed with w to indicate that it is a way ID, e.g. w23412. To get the actual nodes of the ways, I need to query the planet_osm_ways table with the IDs I obtained, minus the w, of course.

To summarize, I've got the following table structure:

   Table "public.planet_osm_rels"
   Column    |   Type   | Modifiers 
-------------+----------+-----------
 id          | bigint   | not null
 way_off     | smallint | 
 rel_off     | smallint | 
 parts       | bigint[] | 
 members     | text[]   | 
 tags        | text[]   | 
 pending     | boolean  | not null
 tags_hstore | hstore   | 
Indexes:
    "planet_osm_rels_pkey" PRIMARY KEY, btree (id)
    "planet_osm_rels_idx" btree (id) WHERE pending
    "planet_osm_rels_parts" gin (parts) WITH (fastupdate=off)
    "planet_osm_rels_tags_hstore_idx" gin (tags_hstore)

   Table "public.planet_osm_ways"
   Column    |   Type   | Modifiers 
-------------+----------+-----------
 id          | bigint   | not null
 nodes       | bigint[] | not null
 tags        | text[]   | 
 pending     | boolean  | not null
 tags_hstore | hstore   | 
Indexes:
    "planet_osm_ways_pkey" PRIMARY KEY, btree (id)
    "planet_osm_ways_idx" btree (id) WHERE pending
    "planet_osm_ways_nodes" gin (nodes) WITH (fastupdate=off)

I've come up with the following query:

SELECT  nodes
FROM    planet_osm_ways
WHERE   id IN (
      SELECT    trim(leading 'w' from unnest)::int
      FROM (
        SELECT  unnest(members)
        FROM    planet_osm_rels
        WHERE   (tags_hstore @> '"type"=>"boundary", "admin_level"=>"2", "name:en"=>"Germany"'))
      AS        unnest
      WHERE     unnest LIKE 'w%');

The query is, small wonder, quite slow. I know that I can (a) eliminate the members column by providing a link table and throw in some more indexes. However, I'd also like to optimize the query itself and at least remove one of the sub-queries, as the query plan is quite complex:

                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=299957.16..300008.23 rows=90957940 width=8)
   ->  HashAggregate  (cost=299957.15..299957.16 rows=1 width=32)
         ->  Subquery Scan on unnest  (cost=0.00..299954.76 rows=956 width=32)
               Filter: (unnest.unnest ~~ 'w%'::text)
               ->  Seq Scan on planet_osm_rels  (cost=0.00..297563.51 rows=191300 width=180)
                     Filter: ((tags)::hstore @> '"type"=>"boundary", "name:en"=>"Germany", "admin_level"=>"2"'::hstore)
   ->  Index Only Scan using planet_osm_ways_pkey on planet_osm_ways  (cost=0.01..51.06 rows=1 width=8)
         Index Cond: (id = (ltrim(unnest.unnest, 'w'::text))::integer)
(8 rows)

And the EXPLAIN ANALYZE:

                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=299957.16..299980.93 rows=39090200 width=1147) (actual time=18680.342..36216.571 rows=1266 loops=1)
   ->  HashAggregate  (cost=299957.15..299957.16 rows=1 width=32) (actual time=18606.686..18608.105 rows=1259 loops=1)
         ->  Subquery Scan on unnest  (cost=0.00..299954.76 rows=956 width=32) (actual time=468.391..18606.233 rows=1259 loops=1)
               Filter: (unnest.unnest ~~ 'w%'::text)
               Rows Removed by Filter: 1283
               ->  Seq Scan on planet_osm_rels  (cost=0.00..297563.51 rows=191300 width=180) (actual time=468.376..18605.288 rows=2542 loops=1)
                     Filter: ((tags)::hstore @> '"type"=>"boundary", "name:en"=>"Germany", "admin_level"=>"2"'::hstore)
                     Rows Removed by Filter: 1912651
   ->  Index Scan using planet_osm_line_pkey on planet_osm_line  (cost=0.01..23.73 rows=3 width=1155) (actual time=13.926..13.978 rows=1 loops=1259)
         Index Cond: (osm_id = (ltrim(unnest.unnest, 'w'::text))::bigint)
 Total runtime: 36217.277 ms

The number of rows returned isn't quite as high as to explain the long runtime:

 count 
-------
  1266

I cannot use SELECT unnest(members) AS unnested .... WHERE unnested LIKE 'w%' since the "unnested" part is unknown to the WHERE clause. Is there any better way to do it?

Best Answer

IN queries with huge sets are notoriously slow. It's often faster to use a JOIN instead:

SELECT nodes
FROM   planet_osm_ways
JOIN   (
   SELECT ltrim(member, 'w')::bigint AS id
   FROM  (
      SELECT unnest(members) AS member
      FROM   planet_osm_rels
      WHERE  (tags_hstore @> '"type"=>"boundary", "admin_level"=>"2", ...')
      ) u
   WHERE member LIKE 'w%'
   ) x USING (id);

But that's not the most important problem here. I wonder why the GIN index planet_osm_rels_tags_hstore_idx is not being used. Are you selecting large enough parts of the table planet_osm_rels to justify a sequential scan?

Oh, and id is type bigint. So cast to bigint instead of int for less friction.

If you can extract "way IDs" and save them redundantly in a separate column way_ids bigint[] in your table, your query would become quite a bit simpler and faster, with one less subquery level:

SELECT nodes
FROM   planet_osm_ways
JOIN   (
   SELECT unnest(way_ids) AS id
   FROM   planet_osm_rels
   WHERE  (tags_hstore @> '"type"=>"boundary", "admin_level"=>"2", ...')
   ) u USING (id);