MongoDB Index – When is the _id Index Used?

indexmongodb

When is the _id index used by MongoDB?

I could not find anything specific about it in Index Introduction, Analyze Query Performance, and Query Plans.

When sorting by _id at the end of the query via .sort({_id: -1}), .explain() explicitly shows the index use. It might be that it is implicitly used in every COLLSCAN operation, though.

So is it just a normal index that is only accessed when sorting/querying by _id, or does it have hidden uses not shown by .explain()?

Best Answer

When sorting by _id at the end of the query via .sort({_id: -1}), .explain() explicitly shows the index use. It might be that it is implicitly used in every COLLSCAN operation, though.

A COLLSCAN operation is a collection scan, which indicates that no index is being used and the collection is being iterated in natural order.

If your query had no criteria but requested a sort by _id, the _id index would be the best candidate to return results in the expected order. In MongoDB 3.0 I would expect this to be indicated as an IXSCAN rather than a COLLSCAN.

So is it just a normal index that is only accessed when sorting/querying by _id

As you suspected, the _id index is just a normal index which is a candidate for being selected by the query planner when querying and/or sorting by _id.

does it have hidden uses not shown by .explain()

The explain() results shows relevant query planning information, but by default only include a summary of the winning plan. You can see more detailed information (such as the candidate plans evaluated) using .explain(true).

MongoDB 3.0's explain() actually has 3 verbosity modes:

queryPlanner: default
executionStats: more info on the winning plan
allPlansExecution (aka true): information on all plans

There are some other uses of the _id index which aren't exactly hidden but may not be obvious:

replication requires a unique _id index (historical note: in versions prior to MongoDB 2.2 it was possible to create and replicate capped collections without an _id index).
a snapshot cursor iterates a collection in _id order to ensure a query will not return a document multiple times.

Related Solutions

Postgresql – Postgres not using the index even when rows returned is 5% of the table

From your query plans, it looks like you're comparing ints to ints in the first query plan, and int to numeric in the second plan.

Your first compare:

Index Cond: (("timestamp" >= 1431100800) AND ("timestamp" <= 1431108000))

and

timestamp >= 1431100800 and timestamp <= 1431108000

In the second query, it's numeric values:

Filter: ((numvalues[1] IS NOT NULL) AND (("timestamp")::numeric >= 1431100800.00) AND (("timestamp")::numeric <= 1431108000.00))

and

timestamp >= 1431093600.00 and timestamp <= 1431100800.00

Casting to numeric causes the index to be ignored in favor of a sequential scan.

You can see this with a very simple example, set up below:

CREATE TABLE t2 (a int);
CREATE INDEX t2_a_idx ON t2(a);
INSERT INTO t2 (a) SELECT i FROM generate_series(1,1000000) AS i;
VACUUM ANALYZE VERBOSE t2;

My first query plan looks like this:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000;

Index Only Scan using t2_a_idx on t2 (cost=0.42..7134.65 rows=250413 width=4) 
(actual time=0.019..29.926 rows=250000 loops=1)
Index Cond: (a > 750000)
Heap Fetches: 0
Planning time: 0.137 ms
Execution time: 39.114 ms
(5 rows)
Time: 39.540 ms

While a second query using numerics looks like this:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000.00;

Seq Scan on t2  (cost=0.00..19425.00 rows=333333 width=4) (actual time=122.803..175.326 rows=250000 loops=1)
Filter: ((a)::numeric > 750000.00)
Rows Removed by Filter: 750000
Planning time: 0.058 ms
Execution time: 184.194 ms
(5 rows)
Time: 184.487 ms

In the second instance here, the index is ignored in favor of a sequential scan because of the cast to a numeric value, which looks like exactly what's happening in your two examples.

One last aside, you might be able to speed your query up via a SET query before executing it as well:

SET work_mem = 2GB;

If your server can handle it, because your sorts are spilling to disk, as noted in this line from your query plan:

Sort Method: external merge  Disk: 1387704kB

Hope this helps. =)

PostgreSQL GIN Index Not Used with ORDER Clause

Your added comment is on the right track already:

I further found out that the inefficient index is only used if the search query (foo in my example) appears in pg_stats.most_common_vals for this column. So I assume this skews the estimated costs into the wrong direction. Any ideas how to fix this?

If 'foo' is a very common, Postgres expects it to be faster to just read from the b-tree index sequentially and just skip rows that do not match. The estimation is also based on cost setting and the expected selectivity of predicates. There are multiple entry points for skewed estimates.

The row count in the statistics is off.
Selectivity estimates for predicates are off.
The effect of combining multiple predicates is misjudged.
Cost settings are off.

Selectivity of predicates and combinations thereof

Your most important problem seems to be here:

(cost=0.42..84,314.74 rows=2,088 width=58) (actual time=2,363.900..2,363.904 rows=10 loops=1)

Postgres expect 209 times as many rows as returned. explain.depesz.com can help auditing a query plan: http://explain.depesz.com/s/53E

You may be able to get more accurate estimations by increasing the statistics target for the indexed columns. Postgres not only gathers statistics for table columns, it does the same for indexed expressions (not for plain index columns for which statistics are available already). Details:

Index that is not used, yet influences query

You can check with:

SELECT * FROM pg_stats WHERE tablename = 'users_first_name_gin';

Basic row counts and settings are in pg_class and pg_attribute:

SELECT *
FROM   pg_attribute
WHERE  attrelid = 'users_last_name_gin'::regclass;

SELECT *
FROM   pg_class
WHERE  oid = 'users_last_name_gin'::regclass;

Indexes are treated as special tables internally. This should make it less surprising that you can do use ALTER TABLE on an index:

ALTER TABLE users_last_name_gin ALTER COLUMN f_unaccent SET STATISTICS 1000;

Use this to increase the sample size for calculating statistics for the index column. Then run ANALYZE users to update statistics.

You can't provide explicit column names for index entries, names are chosen automatically. You can look it up with the query on pg_attribute above. The column name f_unaccent is derived from the used function name.

Default statistics target is 100, the range of 0 - 10000 is allowed. For very big tables, 100 is often not enough to get reasonable estimates. Set it to 1000 (example) for the index expression to get better estimates.

Workaround

Like dezso commented you can get the alternative query plan by encapsulating the original form in a CTE (which acts as optimization barrier in Postgres) - before ORDER BY and LIMIT in the outer SELECT:

WITH cte AS (
   SELECT *, count(*) OVER() AS filtered_count
   FROM   users 
   WHERE (f_unaccent("users"."first_name")       ILIKE f_unaccent('%foo%') OR 
          f_unaccent("users"."last_name")        ILIKE f_unaccent('%foo%') OR 
          f_unaccent("users"."club_or_hometown") ILIKE f_unaccent('%foo%'))
   )
SELECT *
FROM   cte
ORDER  BY first_name
LIMIT  10;

Alternative index

Since you commented:

I'm always searching all three columns

A single index for all three would be cheaper. GIN indexes can be multicolumn indexes. The documentation:

Currently, only the B-tree, GiST and GIN index types support multicolumn indexes.

And:

A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns. Unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use.

So:

CREATE INDEX big_unaccent_big_gin_idx ON users USING gin (
     f_unaccent(first_name)       gin_trgm_ops
   , f_unaccent(last_name)        gin_trgm_ops
   , f_unaccent(club_or_hometown) gin_trgm_ops);

This would reduce the triple overhead per index entry to just one. Should be faster overall. Or, faster yet, concatenate all three columns into a single string. I am adding a space as separator to avoid false positives. Use any character as separator that is not going to show up in the search expression:

CREATE INDEX big_unaccent_big_gin_idx ON users USING gin (
   f_unaccent(concat_ws(' ', first_name, last_name, club_or_hometown)) gin_trgm_ops);

If all columns are NOT NULL, you can use plain concatenation instead:

first_name || ' ' || last_name || ' ' || club_or_hometown

Be sure to use the same expression in your query:

WHERE f_unaccent(concat_ws(' ', first_name, last_name, club_or_hometown)) ILIKE '%foo%'

Set STATISTICS to 1000 or more like demonstrated above and ANALYZE before you test again. Be sure to run the query multiple times to compare warm cache to warm cache.

Besides the smaller index and faster computation, the main benefit for your case could be that a single predicate is less susceptible to errors in the cost estimation. Combining multiple predicates adds errors to the calculation.

Force query plan

If all else fails, you can force your preferred query plan like I suggested in the comment by disabling index scans temporarily. Remember, this is an evil hack that may fire back if underlying data distribution changes or if you upgrade to the next Postgres version:

Use SET LOCAL to confine the effect to the transaction and wrap the whole thing in an explicit transaction.

BEGIN;
SET LOCAL enable_indexscan = off;
SELECT  ...  -- your query here
COMMIT;