Postgresql – Postgres doesn’t use expression index for group by

indexoptimizationperformancepostgresqlquery-performance

I'm doing a simple group by on a single column with a matching index and it works fine:

create index on contacts (home_city);
select home_city, count(*) 
from contacts 
group by home_city 
limit 50;

Takes 150 ms on an 8 million row table. That's plenty fast enough. EXPLAIN ANALYZE output:

"Limit  (cost=1000.46..9256.24 rows=50 width=17) (actual time=14.201..76.630 rows=50 loops=1)"
"  ->  Finalize GroupAggregate  (cost=1000.46..207395.17 rows=1250 width=17) (actual time=14.200..62.196 rows=50 loops=1)"
"        Group Key: home_city"
"        ->  Gather Merge  (cost=1000.46..207370.17 rows=2500 width=17) (actual time=14.100..76.557 rows=97 loops=1)"
"              Workers Planned: 2"
"              Workers Launched: 2"
"              ->  Partial GroupAggregate  (cost=0.43..206081.59 rows=1250 width=17) (actual time=0.897..56.260 rows=41 loops=3)"
"                    Group Key: home_city"
"                    ->  Parallel Index Only Scan using contacts_6_home_city_idx on contacts_6  (cost=0.43..188620.82 rows=3489654 width=9) (actual time=0.026..28.866 rows=105725 loops=3)"
"                          Heap Fetches: 0"
"Planning Time: 0.099 ms"
"Execution Time: 76.680 ms"

Now I want to do it with a case-insensitive index:

create index on contacts (lower(home_city));
select lower(home_city), count(*) 
from contacts 
group by lower(home_city) 
limit 50;

Takes 3-4 seconds. It appears to be using the wrong index:

"Limit  (cost=215869.40..215882.20 rows=50 width=40) (actual time=4253.233..4253.846 rows=50 loops=1)"
"  ->  Finalize GroupAggregate  (cost=215869.40..216175.66 rows=1197 width=40) (actual time=4253.232..4253.335 rows=50 loops=1)"
"        Group Key: (lower((home_city)::text))"
"        ->  Gather Merge  (cost=215869.40..216148.72 rows=2394 width=40) (actual time=4253.219..4253.796 rows=102 loops=1)"
"              Workers Planned: 2"
"              Workers Launched: 2"
"              ->  Sort  (cost=214869.38..214872.37 rows=1197 width=40) (actual time=4244.566..4244.630 rows=710 loops=3)"
"                    Sort Key: (lower((home_city)::text))"
"                    Sort Method: quicksort  Memory: 88kB"
"                    Worker 0:  Sort Method: quicksort  Memory: 89kB"
"                    Worker 1:  Sort Method: quicksort  Memory: 120kB"
"                    ->  Partial HashAggregate  (cost=214793.22..214808.18 rows=1197 width=40) (actual time=4241.397..4241.658 rows=1021 loops=3)"
"                          Group Key: lower((home_city)::text)"
"                          ->  Parallel Index Only Scan using contacts_6_home_city_idx on contacts_6  (cost=0.43..197344.95 rows=3489654 width=32) (actual time=0.059..2918.355 rows=2778720 loops=3)"
"                                Heap Fetches: 0"
"Planning Time: 0.118 ms"
"Execution Time: 4253.906 ms"

When I delete the index on plain home_city, it does a full table scan and takes 6 seconds.

How do I get it to use the right index? What am I missing here?

(I'm using Postgres 11.5).

Best Answer

There is this code in check_index_only in src/backend/optimizer/path/indxpath.c:

/*
 * check_index_only
 *      Determine whether an index-only scan is possible for this index.
 */
static bool
check_index_only(RelOptInfo *rel, IndexOptInfo *index)
{

[...]

    /*
     * Construct a bitmapset of columns that the index can return back in an
     * index-only scan.  If there are multiple index columns containing the
     * same attribute, all of them must be capable of returning the value,
     * since we might recheck operators on any of them.  (Potentially we could
     * be smarter about that, but it's such a weird situation that it doesn't
     * seem worth spending a lot of sweat on.)
     */
    for (i = 0; i < index->ncolumns; i++)
    {
        int         attno = index->indexkeys[i];

        /*
         * For the moment, we just ignore index expressions.  It might be nice
         * to do something with them, later.
         */
        if (attno == 0)
            continue;

So if I read that right, expressions are ignored when an index only scan is considered.

I don't think that's a matter of principle, it is just not implemented.

From PostgreSQL v12 on, you could use a case-insensitive ICU collation to get what you want:

CREATE COLLATION und_ci (
   LOCALE = 'und-u-ks-level2',
   PROVIDER = icu,
   DETERMINISTIC = FALSE
);

Then you'd have to change the column to use that collation:

ALTER TABLE contacts ALTER home_city TYPE text COLLATE und_ci;

This index can be used for case insensitive searches, and you can use it for an index only scan too.