PostgreSQL – Why Index is Not Used for IN Expression with Unnested Array

I have a query that joins two tables but refuses to use use index unless I add a limit. The case is a bit complicated so best described by showing you.

I have two tables, users and user_latest_locations, that looks something like this except that the users table contain around 30 columns for different things.

CREATE TABLE users (
    id serial PRIMARY KEY,
    name character varying(50) NOT NULL,
    description character varying(1000) NOT NULL,
    gender gender NOT NULL,
    looking_for_gender gender NOT NULL,
    latest_location geometry,
    search_radius integer NOT NULL,
    search_min_age integer NOT NULL,
    search_max_age integer NOT NULL,
    email character varying(50),
    password character varying(60) NOT NULL,
    birthdate date NOT NULL,
    last_activity timestamp without time zone NOT NULL,
    updated_time timestamp without time zone NOT NULL,
    created_time timestamp without time zone NOT NULL,
    status status NOT NULL DEFAULT 'pending'::status,
    country_code integer,
    mobile_number character varying(32),
    hide_contacts boolean NOT NULL DEFAULT true,
    occupation character varying(100) NOT NULL DEFAULT ''::character varying,
    school character varying(100) NOT NULL DEFAULT ''::character varying,
    hometown character varying(100) NOT NULL DEFAULT ''::character varying,
    hangouts character varying(1000) NOT NULL DEFAULT ''::character varying,
    popularity double precision NOT NULL DEFAULT 0,
    job character varying(100) NOT NULL DEFAULT ''::character varying,
    preview_push_message boolean NOT NULL DEFAULT true,
    region_ids integer[],
    intent intent NOT NULL DEFAULT 'fate'::intent,
    hide_mutual_contacts boolean NOT NULL DEFAULT false
);

CREATE UNIQUE INDEX users_mobile_idx ON users (mobile_number, country_code);

CREATE TABLE user_latest_locations (
  user_id integer NOT NULL PRIMARY KEY,
  last_activity timestamp without time zone NOT NULL,
  latest_location geometry
);

Both tables contain one row for every user, around 1 million in total.
The query looks like this:

SELECT * FROM users u
  LEFT JOIN user_latest_locations ull 
  ON ull.user_id = u.id
WHERE mobile_number IN (
  SELECT unnest('{1}'::text[]) as number 
  UNION 
  SELECT unnest('{1}'::text[]) as number)

And the resulting query plan looks like this:

Hash Right Join  (cost=24084.22..82388.58 rows=453489 width=306) (actual time=153.343..1123.307 rows=10 loops=1)
  Hash Cond: (ull.user_id = u.id)
  ->  Seq Scan on user_latest_locations ull  (cost=0.00..17549.25 rows=950725 width=44) (actual time=0.010..471.724 rows=949813 loops=1)
  ->  Hash  (cost=2472.61..2472.61 rows=453489 width=262) (actual time=0.527..0.527 rows=10 loops=1)
        Buckets: 16384  Batches: 64  Memory Usage: 129kB
        ->  Nested Loop  (cost=3.94..2472.61 rows=453489 width=262) (actual time=0.111..0.218 rows=10 loops=1)
              ->  HashAggregate  (cost=3.51..5.51 rows=200 width=0) (actual time=0.057..0.059 rows=1 loops=1)
                    Group Key: (unnest('{1}'::text[]))
                    ->  Append  (cost=0.00..3.01 rows=200 width=0) (actual time=0.023..0.040 rows=2 loops=1)
                          ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.019..0.022 rows=1 loops=1)
                          ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.004..0.006 rows=1 loops=1)
              ->  Index Scan using users_mobile_idx on users u  (cost=0.42..12.31 rows=2 width=262) (actual time=0.042..0.118 rows=10 loops=1)
                    Index Cond: ((mobile_number)::text = (unnest('{1}'::text[])))
Planning time: 1.310 ms
Execution time: 1123.851 ms

Why does the query do Seq Scan on user_latest_locations? Using its primary key would be much faster? If I add a LIMIT like below to the union it start using the index as expected.

SELECT * FROM users u
  LEFT JOIN user_latest_locations ull 
  ON ull.user_id = u.id
WHERE mobile_number IN (
  SELECT unnest('{1}'::text[]) as number 
  UNION 
  SELECT unnest('{1}'::text[]) as number limit 100)

The problem is that this is a function that take a array of mobile numbers which can be of varying length, it would be quite ugly to have a arbitrary limit inside the function.

I should also add that the stats for this table should indicate that its mostly unique values:

 n_distinct | correlation 
------------+-------------
         -1 |   0.0183006

The behaviour is the same on both PostgreSQL 9.3 and 9.5. Analyzing the tables doesn't help.

Best Answer

The planner doesn't look inside your unnest operations and see that there is only one 1 value in the arrays. It makes its standard assumption that unknown arrays will have 100 elements each, or 200 for the union. So it thinks there will be a lot (453489) of users which meet the criteria, and that it will be faster to read the entire user_latest_locations and a hash join, rather than doing 453489 individual index lookups.

One possible solution is to do the unnest...union...unnest query separately and store the results into an array variable, and then pass that array to the main query. This way the planner is more likely to see the true size of the array.

Best Answer

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

PostgreSQL Optimization – Explain Plan with Giant Gaps Between Operations

Related Question