How much does an index need to narrow the results of a search in order to be useful

index

How much does an index need to narrow the results of a search in order to be useful in speeding up queries?

Some examples all across the spectrum:

A column for storing true/false values obviously has only two unique values.
A 'last name' column probably has many unique values (although it may not).
A primary key column has all unique values.

I think that the goal of an index is to quickly narrow a search to a few rows, and that therefore, the last case is the best, the second is OK, and the first is useless.

Am I correct? If so, roughly where is the line of usefulness? For example, if an index can narrow the results to 1% of the rows, is that useful? What about 10% or 25%?

Best Answer

The diversity of a column's data is known as selectivity. Selectivity is useful to know when determining whether an index will be useful, but it is not the only thing that determines the speed benefit. Other factors include how fast the storage the index is on compared to the table, how much of the table/index is already cached, how large the index is in comparison to the table, and several other things.

Knowing the data-type of the column does not necessarily help us determine how selective an index on the column will be. Even a column constrained to two values might use those values for only a few rows and have the remainder NULL. On the other hand, a column that could have many distinct values could have the same value in every single row. Even with your id column where all the rows would have unique values, if you are searching for rows with an id >= 10, the index probably wouldn't be useful even though it is highly selective.

You can't use selectivity alone to determine whether an index will be useful or not because even if it returns 100% of the rows, if the index includes all the data necessary for the query it will be faster than using the table. On the other hand, for a small table it may be faster to query the whole thing even if the row being sought only represents 1% of the total.

Determining what indexes should be created is less about looking at the table structure than it is looking at the important queries and what data they need to retrieve.

Related Solutions

Performance of query with a range condition and order by

I would say give that covering index a try (lset, frequency, word), but you did not give much information. Please post how many rows does your table have, how large in bytes it is, how many maximum rows are you expecting to get back from your query, what's the cardinality of your data?

PostgreSQL – Can Spatial Index Help a Range-Order By-Limit Query

You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:

--testbed and lexikon dummy data:

begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial, 
                      word text, 
                      frequency integer, 
                      lset integer, 
                      width_granule integer);
--
insert into lexikon(word, frequency, lset) 
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'

granule analysis (mostly for information and tuning):

create table granule as 
select width_granule, count(*) as freq, 
       min(frequency) as granule_start, max(frequency) as granule_end 
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
 width_granule |  freq  | granule_start | granule_end
---------------+--------+---------------+-------------
             0 | 500000 |             1 |           1
             1 | 300000 |             2 |           4
             2 | 123077 |             5 |          12
             3 |  47512 |            13 |          33
             4 |  18422 |            34 |          90
             5 |   6908 |            91 |         244
             6 |   2580 |           245 |         665
             7 |    949 |           666 |        1808
             8 |    349 |          1811 |        4901
             9 |    129 |          4926 |       13333
            10 |     47 |         13513 |       35714
            11 |     17 |         37037 |       90909
            12 |      7 |        100000 |      250000
            13 |      2 |        333333 |      500000
            14 |      1 |       1000000 |     1000000
*/
alter table granule drop column freq;
--

function for scanning high frequencies first:

create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
       returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
  m integer;
  n integer := 0;
  r record;
begin 
  for r in (select width_granule from granule order by width_granule desc) loop
    return query( select * 
                  from lexikon 
                  where width_granule=r.width_granule 
                        and lset>=p_lset_low and lset<=p_lset_high );
    get diagnostics m = row_count;
    n = n+m;
    exit when n>=p_limit;
  end loop;
end;$$;

results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)

first using the function we've written:

\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 0.510 ms
*/

and then with a simple index scan:

select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 51.250 ms
*/
\timing off
--
rollback;

Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.

Best Answer

Related Solutions

Performance of query with a range condition and order by

PostgreSQL – Can Spatial Index Help a Range-Order By-Limit Query

Related Question