This may be too late for the original poster, but for completeness, the way to achieve case insensitive behaviour from PostgreSQL is to set a non-deterministic collation. This is only for Postgres 12.
Details are described in docs here. Reproducing relevant portion for completeness:
A collation is either deterministic or nondeterministic. A
deterministic collation uses deterministic comparisons, which means
that it considers strings to be equal only if they consist of the same
byte sequence. Nondeterministic comparison may determine strings to be
equal even if they consist of different bytes. Typical situations
include case-insensitive comparison, accent-insensitive comparison, as
well as comparison of strings in different Unicode normal forms. It is
up to the collation provider to actually implement such insensitive
comparisons; the deterministic flag only determines whether ties are
to be broken using bytewise comparison. See also Unicode Technical
Standard 10 for more information on the terminology.
To create a nondeterministic collation, specify the property
deterministic = false to CREATE COLLATION, for example:
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
This example would use the standard Unicode collation in a
nondeterministic way. In particular, this would allow strings in
different normal forms to be compared correctly. More interesting
examples make use of the ICU customization facilities explained above.
For example:
CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
All standard and predefined collations are deterministic, all
user-defined collations are deterministic by default. While
nondeterministic collations give a more “correct” behavior, especially
when considering the full power of Unicode and its many special cases,
they also have some drawbacks. Foremost, their use leads to a
performance penalty. Also, certain operations are not possible with
nondeterministic collations, such as pattern matching operations.
Therefore, they should be used only in cases where they are
specifically wanted.
You have a lot of indexes and my first impuls would be to check whether all of them are actually used and remove the unused ones.
The optimal index for your query is not there, yet, though: a multicolumn index on (scat_group_code, revenue)
. Since this is a materialized view, it's reasonable to assume a read-only situation, which is perfect for index-only scans.
Also, since the table (many columns, some potentially big text columns) is much wider than the index (two columns, fewer rows for the partial variant), an index-only scan is particularly efficient here. Size matters.
If this particular query is very common / important, I would go one step further and make it a partial, multicolumn index:
CREATE INDEX ON v_premises_filter (scat_group_code, revenue)
WHERE area IS NOT NULL
AND revenue IS NOT NULL;
I added the 2nd condition revenue IS NOT NULL
because min()
and max()
ignore NULL values. I tested with Postgres 9.6. You may have to add the logically redundant predicate AND revenue IS NOT NULL
to your query in older versions to make the query planner understand the index is applicable.
The index your query used in your tests (v_premises_filter_revenue_idx
) is on (revenue)
, which is pretty inefficient for a predicate on scat_group_code
. Adding that column as first index item makes a big difference, because Postgres can cheaply grab only the share with matching scat_group_code
now. Where your 2nd query had to filter 1631527 rows, (Rows Removed by Filter: 1631527"
) it only has to filter much fewer rows with area is null
OR revenue is null
for the simple index and none at all for the partial index.
Also, since irrelevant rows are excluded from the index, it is substantially smaller, which contribute3s to performance as well.
You should see two index-only scans and a much faster query in either case now.
Related:
Why the fluctuation?
Your table is big (1909175 rows) and data distribution is obviously irregular. The default statistics_target
setting of 100 is probably too small to gather enough details. Postgres decides to use a bad query plan for your 2nd query, based on insufficient information. It might help to increase the STATISTICS
setting for scat_group_code
, revenue
and area
.
But you won't need this with the suggested new index any more, because an index-only scan on the new index is always the best plan.
Consider this related answer from just yesterday:
Best Answer
You can use
NOT LIKE
to include all the non taxi related data -- check that thephase
doesn't begin withTAXI_
. Then add for each size a comparison that couples the airport size to the taxi route withOR
.