Performance of query with a range condition and order by

database-designindexperformancequery-performancesqlite

We have the following table (in SQLite on Android) which a tree structure (Nested Set model) of words and their frequencies:

lexikon
-------
_id   integer  PRIMARY KEY
word  text
frequency integer
lset  integer  UNIQUE KEY
rset  integer  UNIQUE KEY

And the query:

SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N

I suppose a covering index on (lset, frequency, word) would be useful but I feel it may not perform well if there are too many lset values in the (@High, @Low) range.

A simple index on (frequency DESC) may also be sufficient sometimes, when a search using that index yields early the @N rows that match the range condition.

But it seems that performance depends a lot on the parameter values.

Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?

Best Answer

I would say give that covering index a try (lset, frequency, word), but you did not give much information. Please post how many rows does your table have, how large in bytes it is, how many maximum rows are you expecting to get back from your query, what's the cardinality of your data?

Related Solutions

PostgreSQL – Can Spatial Index Help a Range-Order By-Limit Query

You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:

--testbed and lexikon dummy data:

begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial, 
                      word text, 
                      frequency integer, 
                      lset integer, 
                      width_granule integer);
--
insert into lexikon(word, frequency, lset) 
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'

granule analysis (mostly for information and tuning):

create table granule as 
select width_granule, count(*) as freq, 
       min(frequency) as granule_start, max(frequency) as granule_end 
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
 width_granule |  freq  | granule_start | granule_end
---------------+--------+---------------+-------------
             0 | 500000 |             1 |           1
             1 | 300000 |             2 |           4
             2 | 123077 |             5 |          12
             3 |  47512 |            13 |          33
             4 |  18422 |            34 |          90
             5 |   6908 |            91 |         244
             6 |   2580 |           245 |         665
             7 |    949 |           666 |        1808
             8 |    349 |          1811 |        4901
             9 |    129 |          4926 |       13333
            10 |     47 |         13513 |       35714
            11 |     17 |         37037 |       90909
            12 |      7 |        100000 |      250000
            13 |      2 |        333333 |      500000
            14 |      1 |       1000000 |     1000000
*/
alter table granule drop column freq;
--

function for scanning high frequencies first:

create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
       returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
  m integer;
  n integer := 0;
  r record;
begin 
  for r in (select width_granule from granule order by width_granule desc) loop
    return query( select * 
                  from lexikon 
                  where width_granule=r.width_granule 
                        and lset>=p_lset_low and lset<=p_lset_high );
    get diagnostics m = row_count;
    n = n+m;
    exit when n>=p_limit;
  end loop;
end;$$;

results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)

first using the function we've written:

\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 0.510 ms
*/

and then with a simple index scan:

select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 51.250 ms
*/
\timing off
--
rollback;

Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.

How much does an index need to narrow the results of a search in order to be useful

The diversity of a column's data is known as selectivity. Selectivity is useful to know when determining whether an index will be useful, but it is not the only thing that determines the speed benefit. Other factors include how fast the storage the index is on compared to the table, how much of the table/index is already cached, how large the index is in comparison to the table, and several other things.

Knowing the data-type of the column does not necessarily help us determine how selective an index on the column will be. Even a column constrained to two values might use those values for only a few rows and have the remainder NULL. On the other hand, a column that could have many distinct values could have the same value in every single row. Even with your id column where all the rows would have unique values, if you are searching for rows with an id >= 10, the index probably wouldn't be useful even though it is highly selective.

You can't use selectivity alone to determine whether an index will be useful or not because even if it returns 100% of the rows, if the index includes all the data necessary for the query it will be faster than using the table. On the other hand, for a small table it may be faster to query the whole thing even if the row being sought only represents 1% of the total.

Determining what indexes should be created is less about looking at the table structure than it is looking at the important queries and what data they need to retrieve.

Best Answer

Related Solutions

PostgreSQL – Can Spatial Index Help a Range-Order By-Limit Query

How much does an index need to narrow the results of a search in order to be useful

Related Question