Postgresql – Compound Index on 43 Million PostgreSQL table

indexindex-tuningpostgresql

This question is related to a prior one of I asked: Order of columns in a compound index in PostgreSQL (and query order)

Rather than overload that question, I figure I can sharpen and limit my question here. Given the following query (and EXPLAIN ANALYZE), is the compound index I am creating helping?

This first query was run with only simple indexes (a GIST on outline) and a (BTREE on pid).

The query is:

EXPLAIN ANALYZE SELECT DISTINCT ON (path) oid, pid, product_name, type, path, size 
FROM portal.inventory AS inv 
WHERE ST_Intersects(st_geogfromtext('SRID=4326;POLYGON((21.51947021484375 51.55059814453125, 18.9129638671875 51.55059814453125, 18.9129638671875 48.8287353515625, 21.51947021484375 48.8287353515625, 21.51947021484375 51.55059814453125))'), inv.outline) 
AND (inv.pid in (20010,20046)) 

The result was the following (which is faster, but perhaps that is only because the database was warm).

"Unique  (cost=581.76..581.76 rows=1 width=89) (actual time=110.436..110.655 rows=249 loops=1)"
"  ->  Sort  (cost=581.76..581.76 rows=1 width=89) (actual time=110.434..110.477 rows=1377 loops=1)"
"        Sort Key: path"
"        Sort Method: quicksort  Memory: 242kB"
"        ->  Bitmap Heap Scan on inventory inv  (cost=577.48..581.75 rows=1 width=89) (actual time=39.257..105.878 rows=1377 loops=1)"
"              Recheck Cond: ((pid = ANY ('{20010,20046}'::integer[])) AND ('0103000020E6100000010000000500000000000000FC843540000000007AC6494000000000B8E93240000000007AC6494000000000B8E9324000000000146A484000000000FC84354000000000146A484000000000FC843540000000007AC64940'::geography && outline))"
"              Rows Removed by Index Recheck: 3731"
"              Filter: (_st_distance('0103000020E6100000010000000500000000000000FC843540000000007AC6494000000000B8E93240000000007AC6494000000000B8E9324000000000146A484000000000FC84354000000000146A484000000000FC843540000000007AC64940'::geography, outline, 0::double precision, false) < 1e-005::double precision)"
"              Rows Removed by Filter: 533"
"              ->  BitmapAnd  (cost=577.48..577.48 rows=1 width=0) (actual time=38.972..38.972 rows=0 loops=1)"
"                    ->  Bitmap Index Scan on inventory_pid_idx  (cost=0.00..123.82 rows=6204 width=0) (actual time=1.116..1.116 rows=7836 loops=1)"
"                          Index Cond: (pid = ANY ('{20010,20046}'::integer[]))"
"                    ->  Bitmap Index Scan on inventory_outline_idx  (cost=0.00..453.41 rows=8212 width=0) (actual time=37.765..37.765 rows=63112 loops=1)"
"                          Index Cond: ('0103000020E6100000010000000500000000000000FC843540000000007AC6494000000000B8E93240000000007AC6494000000000B8E9324000000000146A484000000000FC84354000000000146A484000000000FC843540000000007AC64940'::geography && outline)"
"Total runtime: 110.731 ms"

Now here is the result with the compound index added: (note that the absolute time was slower)

"Unique  (cost=37.81..37.82 rows=1 width=89) (actual time=2464.353..2464.561 rows=249 loops=1)"
"  ->  Sort  (cost=37.81..37.82 rows=1 width=89) (actual time=2464.349..2464.389 rows=1377 loops=1)"
"        Sort Key: path"
"        Sort Method: quicksort  Memory: 242kB"
"        ->  Bitmap Heap Scan on inventory inv  (cost=33.54..37.80 rows=1 width=89) (actual time=2361.018..2459.653 rows=1377 loops=1)"
"              Recheck Cond: (('0103000020E6100000010000000500000000000000FC843540000000007AC6494000000000B8E93240000000007AC6494000000000B8E9324000000000146A484000000000FC84354000000000146A484000000000FC843540000000007AC64940'::geography && outline) AND (pid = ANY ('{20010,20046}'::integer[])))"
"              Filter: (_st_distance('0103000020E6100000010000000500000000000000FC843540000000007AC6494000000000B8E93240000000007AC6494000000000B8E9324000000000146A484000000000FC84354000000000146A484000000000FC843540000000007AC64940'::geography, outline, 0::double precision, false) < 1e-005::double precision)"
"              Rows Removed by Filter: 533"
"              ->  Bitmap Index Scan on inventory_compound_idx  (cost=0.00..33.53 rows=1 width=0) (actual time=2321.684..2321.684 rows=1910 loops=1)"
"                    Index Cond: (('0103000020E6100000010000000500000000000000FC843540000000007AC6494000000000B8E93240000000007AC6494000000000B8E9324000000000146A484000000000FC84354000000000146A484000000000FC843540000000007AC64940'::geography && outline) AND (pid = ANY ('{20010,20046}'::integer[])))"
"Total runtime: 2558.022 ms"

Finally, here is the table definition:

CREATE TABLE portal.inventory
(
  oid bigint,
  product_name character varying(100),
  type character varying(25),
  pid integer,
  size bigint,
  date timestamp without time zone,
  path character varying(200),
  outline geography(Polygon,4326)
)
WITH (
  OIDS=FALSE
);


CREATE INDEX inventory_compound_idx
  ON portal.inventory
  USING gist
  (outline, pid);


CREATE INDEX inventory_outline_idx
  ON portal.inventory
  USING gist
  (outline);


CREATE INDEX inventory_pid_idx
  ON portal.inventory
  USING btree
  (pid);

UPDATE: Answers to questions listed below:

I can adapt the table, but I am trying to keep the rows thin. Your suggestions are varying, types, etc. are things I would like to change.

Basically, each row represents a little bit of meta data about a geospatial image file. We are managing 50M, and this may well grow to hundreds of millions or more. In the DB each file is referenced by a unique OID (sorry for the duplication of the term). They are grouped by "products" where PID is the product ID. There can be about 1,000 OIDs per product. Each image file has a geospatial bounding box (the outline). That is really all that I need for searching. The rest of the data will not be null (type is a text string, size the the file size, date is the date the file was created, and path is a UNC file path to the file).

Now here is why I orderd the query by outline, then PID. Products will be geospatially grouped. So all OID rows for Krakow Poland will be located in the same region physically. So I assume that if I get the bucket down to a small region, the second index will be pretty small (say about 100 products for a city region). Which the IN( ..) clause will pull out.

The actually values of the PIDS were pulled from the other question I posted here. But that table only for products, and thus its size is about 30K, which means quick searches and no need for compound queries.

I wonder if the POSTGreSQL planner is smart enough to decide if a compound index by (outline,pid) is faster than (pid, outline) if both indexes are there. Well I guess I can test.

Best Answer

In a GiST index, the order of columns has a different significance than in a B-tree index. Per documentation:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.

In short: put the most selective columns first.

Your EXPLAIN output shows that the condition on pid is more selective (rows=7836) than the one on outline (rows=63112). If that can be generalized (a single example may be misleading) I suggest this alternative:

CREATE INDEX inventory_compound_idx ON portal.inventory USING gist (pid, outline);

If most of your (important) queries include conditions on both columns, a multicolumn index may serve you well. Else, single columns may be better overall.

Table layout

This is an educated guess since I don't have complete information.

  • Don't use oid as column name. It's easy to confuse with the OID.

  • Don't use the name date for a timestamp column. Or rather: don't use the name date for any column, don't use names of base-types for identifiers at all. Can lead to confusing mistakes and error messages.

  • Create a lookup table for types and only put a small integer type_id into the big table. Pack fixed-length types tightly so not to waste space to padding. Details.

  • I prefer the type text (or varchar without length limit) over varchar(n). Details.

For example:

CREATE TABLE portal.inventory (
   inventory_id bigint PRIMARY KEY
  ,type_id      integer NOT NULL REFERENCES inv_type(type_id)
  ,pid          integer NOT NULL
  ,size         bigint NOT NULL
  ,ts           timestamp NOT NULL
  ,outline      geography(Polygon,4326)
  ,product_name text
  ,path         text
);