PostgreSQL Performance – Sequential Scan vs Index Scan

optimizationperformancepostgresqlquery-performance

My use case is I am trying to select rows based on a list of known text values. For example, I have known text values as 'Frasier' and 'Cheers' and I want to select some rows that match the values.

EXPLAIN
    WITH lcaseid AS (
        SELECT lower(icaseid)
        FROM (
                 VALUES ('Frasier'),
                        ('Big brother')) AS i (icaseid)
    )
    SELECT *
    FROM products.catalog
    WHERE lower(id)
              in (
              SELECT *
              FROM lcaseid);
Hash Semi Join  (cost=0.10..12.76 rows=2 width=587)
  Hash Cond: (lower(catalog.id) = lcaseid.lower)
  CTE lcaseid
    ->  Values Scan on ""*VALUES*""  (cost=0.00..0.03 rows=2 width=32)
    ->  Seq Scan on catalog  (cost=0.00..12.29 rows=129 width=587)
            ->  Hash  (cost=0.04..0.04 rows=2 width=32)
            ->  CTE Scan on lcaseid  (cost=0.00..0.04 rows=2 width=32)

Why is Postgres doing Sequential Scan here? There is an index on lower(id):

CREATE UNIQUE INDEX lower_case_id ON products.catalog ((lower(id)));

When I do this instead, Postgres shows an Index Scan:

EXPLAIN
    SELECT
        *
    FROM
        products.catalog
    WHERE
            lower(id)
            in('frasier');
Index Scan using lower_case_id on catalog  (cost=0.14..8.16 rows=1 width=587)
  Index Cond: (lower(id) = 'frasier'::text)

The problem with the second approach is that I have to have the text values to be in lowercase because I can use them in the in clause. The first approach solves this problem by lower-casing when creating the temporary table lcaseid.

My guess is the temporary table is to blame for the sequential scan? Because Postgres does not have statistics on the temporary table, it simply does a sequential scan.

Also how should I go about solving the problem of selecting based on lowercase text values optimally?


Table definition:

CREATE TABLE "products"."catalog" (
    "id" text NOT NULL,
    "is_original" bool NOT NULL,
    "image_url_1" text NOT NULL,
    "image_url_2" text NOT NULL,
    "image_url_3" text NOT NULL,
    "image_url_4" text NOT NULL,
    PRIMARY KEY ("id")
);

CREATE UNIQUE INDEX lower_case_id ON products.catalog ((lower(id)));

Postgres version: 9.6.12

Best Answer

Your test table is very small:

rows=129 width=587

Around 10 data pages of 8 kB. Your index occupies 2 data pages most likely, incl. 1 meta page. (That's unless either is bloated with dead tuples or something, which decreases general performance while typically favoring the use of indexes.)

Postgres estimates that it pays to involve the index for a single input value, but not for two or more, for which case more and more data pages have to be read anyway. The overhead of involving the index won't pay any more. That may explain the different query plans. Hardly important, though, for the small table where either way is extremely fast, as was pointed out in comments already.

The important point of interest here: Postgres will use the index once the table gets big enough. Missing statistics for the implicit temporary table from the CTE won't stand in the way. While you are right that Postgres does not have detailed statistics there, like a MCV list (most common values) etc., it does know and factor in the number of rows - as can be seen in the query plan you posted:

CTE Scan on lcaseid (cost=0.00..0.04 rows=2 width=32)

And statistics on the table are typically overwhelmingly more important anyway.
If you have a case where detailed statistics on the input data will make a difference, you can create a temporary table instead of the CTE and ANALYZE it manually. (But I don't see that here.) See:

Either way, if your input values won't contain duplicates, you can simplify:

SELECT c.*
FROM   products.catalog c
JOIN  (
   VALUES ('Frasier'),
          ('Big brother')
   ) i(icaseid) ON lower(i.icaseid) = lower(id);

A bit faster because it gets rid of both the CTE and the IN which add extra cost. And less noise.
I don't see anything else to improve for either query or index. If you only need a small selection of columns from a much wider row, there might be potential for an index-only scan. But while you SELECT *, there is not.

There is the additional module citext, which might have useful features for you. But I don't generally recommend it, due to side effects and limitations.