Postgresql – Composite multicolumn index for geopoint range and numeric range query

gin-indexgist-indexindexpostgresql

I am building an app where the server needs to select rows based on some criteria/filters. One of them is the location of the user and the radius at which the user want's to see posts and other filters such date range and filter for a value of another column. This is going to be for an ad-hoc event discovery app.

I have read about PostGIS, its geometry,geography types and I know there is a native point datatype. Based on this answer I understood that it is better to order from equality to range columns, even though I feel like geo point column should be the first.

Suppose the following few rows of a simplified events table (disregard the validity position data):

id  event_title                  event_position   event_type  is_public  start_date
    (varchar)                    (point lat/lon)  (smallint)  (boolean)  (timestamptz)
--  ---------------------------  ---------------  ---------   ---------  ----
 1  "John's Party"               (122,35)         0           0          2020-07-05
 2  "Revolution then Starbucks"  (123,30)         1           1          2020-07-06
 3  "Study for math exam"        (120,36)         2           1          2020-07-07
 4  "Party after exam"           (120,36)         1           1          2020-07-08
 5  "Hiking next to the city"    (95,40)          3           1          2020-07-09
 6  "Football match"             (-42,31)         4           1          2020-07-10

Imagine the table contains several thousand records at least, obviously not only 6.

So in this table a user would be able to query public events close to (122,34) by 100km (suppose first three rows fall into this area) and of event types 0, 1 or 2 falling between dates 2020-07-05 and 2020-07-07. The user would get the rows with ID 2 and 3.

This is the query I want to optimize with an appropriate index. My question is, how is it possible to create such an index? I thought about GiST or GIN index but not sure how these could help. Thanks!

Best Answer

@jjanes' answer on stackoverflow that I accepted

With the help of the btree_gist extension, you can include the event_type and start_date columns into the GiST index along with the event_position. However, the event_type won't be very useful in the index as long as the restriction clause is something like event_type in (0, 1, 2). (But if the list only contains one element, it will be rewritten into an equality, and in that case it can use that column in the index efficiently.) So using the other two columns would be my starting point. I would put the usually more selective one first. If you are going to leave past events in the table rather than clearing them out, then the date is probably going to eventually be the more selective one.

If the location condition is not very selective at all (e.g. most of your events are in NYC, and most of your users are in NYC, and almost everything is within 100km of almost everything else), then you might want a different approach. Just make a default BTREE index on (event_type, start_date). Unlike GiST, such a BTREE can effectively use a condition like event_type in (0, 1, 2) along with AND start_date between x and y.

I don't think that there is a fundamental reason GiST can't use the in-list effectively while BTREE can. Maybe that will be fixed for GiST in some future version.

Answer

Since you refer to the website use-the-index-luke.com, consider the chapter:

Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Rule of thumb: index for equality first — then for ranges.

Also good for just one column?

What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
     , (random() * 4)::int     AS few
FROM generate_series (1, 100000);

DELETE FROM t WHERE random() > 0.9;  -- create some dead tuples, more "real-life"

ANALYZE t;

SELECT count(distinct lots)   -- 9999
     , count(distinct few)    --    5
FROM   t;

Query:

SELECT *
FROM   t
WHERE  lots = 2345
AND    few = 2;

EXPLAIN ANALYZE output (Best of 10 to exclude caching effects):

Seq Scan on t  (cost=0.00..5840.84 rows=2 width=8)
               (actual time=5.646..15.535 rows=2 loops=1)
  Filter: ((lots = 2345) AND (few = 2))
  Buffers: local hit=443
Total runtime: 15.557 ms

Add index, retest:

CREATE INDEX t_lf_idx ON t(lots, few);

Index Scan using t_lf_idx on t  (cost=0.00..3.76 rows=2 width=8)
                                (actual time=0.008..0.011 rows=2 loops=1)
  Index Cond: ((lots = 2345) AND (few = 2))
  Buffers: local hit=4
Total runtime: 0.027 ms

Add other index, retest:

DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx  ON t(few, lots);

Index Scan using t_fl_idx on t  (cost=0.00..3.74 rows=2 width=8)
                                (actual time=0.007..0.011 rows=2 loops=1)
  Index Cond: ((few = 2) AND (lots = 2345))
  Buffers: local hit=4
Total runtime: 0.027 ms

Best Answer

Related Solutions

Mysql – Simplify and optimize a complex query

PostgreSQL Index Tuning – Multicolumn Index and Performance

Answer

Also good for just one column?

Less selective column first?

Related Question