PostgreSQL – Can Spatial Index Help a Range-Order By-Limit Query

database-designindexperformancepostgresqlquery-performance

Asking this question, specifically for Postgres, as it has good supoort for R-tree/spatial indexes.

We have the following table with a tree structure (Nested Set model) of words and their frequencies:

lexikon
-------
_id   integer  PRIMARY KEY
word  text
frequency integer
lset  integer  UNIQUE KEY
rset  integer  UNIQUE KEY

And the query:

SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N

I suppose a covering index on (lset, frequency, word) would be useful but I feel it may not perform well if there are too many lset values in the (@High, @Low) range.

A simple index on (frequency DESC) may also be sufficient sometimes, when a search using that index yields early the @N rows that match the range condition.

But it seems that performance depends a lot on the parameter values.

Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?

Would an R-tree/spatial index help?

Adding indexes, rewriting the query, re-designing the table, there is no limitation.

Best Answer

You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:

--testbed and lexikon dummy data:

begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial, 
                      word text, 
                      frequency integer, 
                      lset integer, 
                      width_granule integer);
--
insert into lexikon(word, frequency, lset) 
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'

granule analysis (mostly for information and tuning):

create table granule as 
select width_granule, count(*) as freq, 
       min(frequency) as granule_start, max(frequency) as granule_end 
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
 width_granule |  freq  | granule_start | granule_end
---------------+--------+---------------+-------------
             0 | 500000 |             1 |           1
             1 | 300000 |             2 |           4
             2 | 123077 |             5 |          12
             3 |  47512 |            13 |          33
             4 |  18422 |            34 |          90
             5 |   6908 |            91 |         244
             6 |   2580 |           245 |         665
             7 |    949 |           666 |        1808
             8 |    349 |          1811 |        4901
             9 |    129 |          4926 |       13333
            10 |     47 |         13513 |       35714
            11 |     17 |         37037 |       90909
            12 |      7 |        100000 |      250000
            13 |      2 |        333333 |      500000
            14 |      1 |       1000000 |     1000000
*/
alter table granule drop column freq;
--

function for scanning high frequencies first:

create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
       returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
  m integer;
  n integer := 0;
  r record;
begin 
  for r in (select width_granule from granule order by width_granule desc) loop
    return query( select * 
                  from lexikon 
                  where width_granule=r.width_granule 
                        and lset>=p_lset_low and lset<=p_lset_high );
    get diagnostics m = row_count;
    n = n+m;
    exit when n>=p_limit;
  end loop;
end;$$;

results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)

first using the function we've written:

\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 0.510 ms
*/

and then with a simple index scan:

select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 51.250 ms
*/
\timing off
--
rollback;

Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.