PostgreSQL – Create Gist Index on Cube Array with 512 Dimensions

gist-indexpostgresql

I want to perform nearest neighbors search on db with 5000 entries of cube array with 512 dimensions.

I referred to this answer and used docker image of postgresql with cube array limit set to 2048 which can be found in the comment to the answer. But I'm getting the following error:

ERROR: failed to add item to index page in "faq_c_idx" SQL state: XX000

It works with dimensions=~200, but when I set higher dimensions I keep receiving the error above.

Here's the query:

CREATE TABLE mytable
AS
  SELECT i, cube(array_agg(random()::float)) AS c
  FROM generate_series(1,5000) AS i
  CROSS JOIN LATERAL generate_series(1,512)
    AS x
  GROUP BY i;

CREATE INDEX ON mytable USING gist ( c );
ANALYZE mytable;

Best Answer

It can't find a place to store the index tuple, probably because the tuple is larger than a page and so can't fit on any page. You could recompile with a larger blocksize, e.g. ./configure --with-blocksize=32, but then you will just run into a different limitation, probably INDEX_SIZE_MASK, which is not so easy to work around.

So, you just can't do this. But if you could, an index into a 512 dimensional cube is not likely to be all that useful anyway. It will probably be slower than a seq-scan would be, or at least not faster by very much.

Related Solutions

PostgreSQL: Issue with a corrupted GiST index on ll_to_earth

Looks like a bug in either the GiST index implementation, the ll_to_earth() function (or maybe the specific operator / operator class). You have pretty much ruled out plain index corruption by trying REINDEX.

First try to upgrade to a current version of Postgres and see if this fixes your problem. 9.5 is getting old (and the current point release is 9.5.15). I did not find any specific entries concerning your case in the release notes since 9.5.10 - on a quick glance! At least upgrade to latest point release, currently 9.5.15 - as is the recommendation of the project. Or to the latest version Postgres 11.1 if that's an option, and see if the problem goes away. (Or recreate your db in pg 11 just to test.)

Both extensions cube and earthdistance have been updated, too. Check curent versions for your installation with:

SELECT * FROM pg_available_extensions where name IN ('cube', 'earthdistance');

For Postgres 11:

cube    1.4  
earthdistance   1.1

To see current versions for any given Postgres version try this fiddle.

If upgrading does not fix the problem please file a bug report.

Postgresql – Containment check tstzrange @> timestamptz not using btree or gist index

The question is what is the reason the index is not used with the tstzrange containment operator and if there is a way to make it work.

The reason is quite trivial. B-tree indexes do no support the containment operator @>. Neither for range types like tstzrange nor for any other type (including array types).

The manual:

... a btree operator class must provide five comparison operators, <, <=, =, >= and >.

And GiST indexes do not support <, <=, > =, >= and >. See these chapters of the manual:

In Postgres indexes are bound to operators (which are implemented for certain types), not data types alone or functions or anything else. Related:

The GiST index event_seating_lookup_created_at_idx2 you have is pointless. It is created on the timestamptz column created_at. Such a GiST index would be useful for a range type (the opposite direction of logic).

Creating a GiST index on a timestamptz column is only possible since you installed the additional btree_gist extension to allow such useless indexes. (There are useful applications for multicolumn indexes or exclusion constraints ...) In stock Postgres you'd get an error:

ERROR: data type timestamp with time zone has no default operator class for access method "gist"

So while it's logically valid and technically possible to use a btree index (or a GiST index) for the query, the case is not implemented: no index support for timestamptz <@ tstzrange (where timestamptz would be the indexed expression!). It can be solved with <, <=, >, >= more efficiently. So I guess no developer felt (or will feel) the need to implement it.

Function to rewrite expression with favorable operators

Your implementation makes sense to me. It does what you want and due to function inlining makes use of a plain btree index on the timestamp column - event_seating_lookup_created_at_idx in your example. For calls with a constant range (like in a single function call) I suggest this modified version:

CREATE OR REPLACE FUNCTION in_range(timestamptz, tstzrange)
  RETURNS boolean AS
$func$
SELECT CASE WHEN lower_inc($2) THEN $1 >= lower($2) ELSE $1 > lower($2) END
   AND CASE WHEN upper_inc($2) THEN $1 <= upper($2) ELSE $1 < upper($2) END
$func$  LANGUAGE sql IMMUTABLE;

Declare it IMMUTABLE, because it actually is. Not to help function inlining (can even prevent it if the declaration is false) but for other gains. Related:

It can be inlined and uses the index just like your version. The difference: this one suppresses a redundant index condition for exclusive bounds. Fortunately, your consideration in this regard is slightly off target:

The reason for first matching on the $1 >= lower($2) AND $1 <= upper($2) condition and then checking the upper_inc and lower_inc constraints is to benefit from range scan first and then filter the result.

Postgres 11 (at least) is even smarter than that. I see no Filter step for your version. For default [) bounds (like in your example), I get this query plan (line breaks in conditions added by me):

  ->  Index Only Scan using foo_idx on foo (actual rows=5206 loops=1)
        Index Cond: ((datetime >= '2018-09-05 22:00:00+00'::timestamp with time zone)
                 AND (datetime <= '2018-09-05 22:30:00+00'::timestamp with time zone)
                 AND (datetime < '2018-09-05 22:30:00+00'::timestamp with time zone))

An actual Filter step could add a more substantial cost for corner cases with many hits on an excluded bound. Those would be fetched from the index and then discarded. Quite relevant for time ranges where values often end up on bounds - like timestamps on the full hour.

The actual difference is see is minor, but why not take it?

  ->  Index Only Scan using foo_idx on foo (actual rows=5206 loops=1)
        Index Cond: ((datetime >= '2018-09-05 22:00:00+00'::timestamp with time zone)
                 AND (datetime < '2018-09-05 22:30:00+00'::timestamp with time zone))

Best Answer

Related Solutions

PostgreSQL: Issue with a corrupted GiST index on ll_to_earth

Postgresql – Containment check tstzrange @> timestamptz not using btree or gist index

Function to rewrite expression with favorable operators

Related Question