You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:
--testbed and lexikon
dummy data:
begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial,
word text,
frequency integer,
lset integer,
width_granule integer);
--
insert into lexikon(word, frequency, lset)
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'
granule
analysis (mostly for information and tuning):
create table granule as
select width_granule, count(*) as freq,
min(frequency) as granule_start, max(frequency) as granule_end
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
width_granule | freq | granule_start | granule_end
---------------+--------+---------------+-------------
0 | 500000 | 1 | 1
1 | 300000 | 2 | 4
2 | 123077 | 5 | 12
3 | 47512 | 13 | 33
4 | 18422 | 34 | 90
5 | 6908 | 91 | 244
6 | 2580 | 245 | 665
7 | 949 | 666 | 1808
8 | 349 | 1811 | 4901
9 | 129 | 4926 | 13333
10 | 47 | 13513 | 35714
11 | 17 | 37037 | 90909
12 | 7 | 100000 | 250000
13 | 2 | 333333 | 500000
14 | 1 | 1000000 | 1000000
*/
alter table granule drop column freq;
--
function for scanning high frequencies first:
create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
m integer;
n integer := 0;
r record;
begin
for r in (select width_granule from granule order by width_granule desc) loop
return query( select *
from lexikon
where width_granule=r.width_granule
and lset>=p_lset_low and lset<=p_lset_high );
get diagnostics m = row_count;
n = n+m;
exit when n>=p_limit;
end loop;
end;$$;
results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)
first using the function we've written:
\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 0.510 ms
*/
and then with a simple index scan:
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 51.250 ms
*/
\timing off
--
rollback;
Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit
clause and size of lset
ranges sought.
The diversity of a column's data is known as selectivity. Selectivity is useful to know when determining whether an index will be useful, but it is not the only thing that determines the speed benefit. Other factors include how fast the storage the index is on compared to the table, how much of the table/index is already cached, how large the index is in comparison to the table, and several other things.
Knowing the data-type of the column does not necessarily help us determine how selective an index on the column will be. Even a column constrained to two values might use those values for only a few rows and have the remainder NULL. On the other hand, a column that could have many distinct values could have the same value in every single row. Even with your id
column where all the rows would have unique values, if you are searching for rows with an id
>= 10, the index probably wouldn't be useful even though it is highly selective.
You can't use selectivity alone to determine whether an index will be useful or not because even if it returns 100% of the rows, if the index includes all the data necessary for the query it will be faster than using the table. On the other hand, for a small table it may be faster to query the whole thing even if the row being sought only represents 1% of the total.
Determining what indexes should be created is less about looking at the table structure than it is looking at the important queries and what data they need to retrieve.
Best Answer
I would say give that covering index a try (lset, frequency, word), but you did not give much information. Please post how many rows does your table have, how large in bytes it is, how many maximum rows are you expecting to get back from your query, what's the cardinality of your data?