Postgresql – Expression index on a citext column ignored, why

citextindexpostgresql

Running on RDS with about 32M rows.

PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit

Also testing locally on macOS with about 8M rows.

PostgreSQL 11.5 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit

I've got a column named old_value that's of type citext. I asked about this already, but posted way to many of my discovery steps along the way. Here's a boiled down version that I'm hoping gets to the point.

Background

I've got a field change log table named record_changes_log_detail with 32M rows and growing that includes a citext field named old_value.

The data is very skeweed. Most values are less than a dozen characters, some are more than 5,000.

Postgres chokes on large values with an error about B-tree entries being limited to 2172 characters. So I believe that for a B-tree, I need to substring the source value.

My users primary interest is in an = search, a starts-with search, and, sometimes, a contains-this-substring search. So = string% and %string%

Goals

Create an index that supports those searches that the planner uses.

Tried and failed

A straight B-tree fails to build, in some cases, because of long values.

An expression B-tree like this builds, but is not used

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON  record_changes_log_detail 
    USING btree (substring(old_value,1,1024));

Adding text_pattern_opts does not help.

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON  record_changes_log_detail 
    USING btree (substring(old_value,1,1024) text_pattern_opts);

Tried and works partially

A hash index works, but only for equality. (Like it says on the tin.)

This is the closest I've gotten to success:

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail 
    USING btree (old_value citext_pattern_ops);

This works for quality, but not for LIKE. The release notes for PG 11 say it should work for LIKE:

https://www.postgresql.org/docs/11/release-11.html

By "work" I mean "the index is used."

I was unable to substring succesfully with this approach.

What do people do in this situation with citext fields?

Best Answer

It is unusual to index such a long column entirely.

Three ideas:

  1. Modify the query like this:

    WHERE substring(old_value, 1, 100) LIKE substring(pattern, 1, 100)
      AND old_value LIKE pattern
    

    (pattern here would be the pattern string, something like 'string%'.)

    Then a b-tree index on substring(old_value, 1, 100) can be used (if the pattern doesn't start with a wildcard character of course).

  2. Depending on the exact requirements (are you searching complete words or word prefixes in a natural language text or not), full text search may be a good solution.

  3. Another option are of course trigram indexes:

    CREATE INDEX ON record_changes_log_detail USING gin (old_value gin_trgm_ops);
    

    This requires the pg_trgm extension to be installed.

    Such an index will work also for search patterns that start with a wildcard. For good performance, enforce a minimum length on the search string.