Asking this question, specifically for Postgres, as it has good supoort for R-tree/spatial indexes.
We have the following table with a tree structure (Nested Set model) of words and their frequencies:
lexikon
-------
_id integer PRIMARY KEY
word text
frequency integer
lset integer UNIQUE KEY
rset integer UNIQUE KEY
And the query:
SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N
I suppose a covering index on (lset, frequency, word)
would be useful but I feel it may not perform well if there are too many lset
values in the (@High, @Low)
range.
A simple index on (frequency DESC)
may also be sufficient sometimes, when a search using that index yields early the @N
rows that match the range condition.
But it seems that performance depends a lot on the parameter values.
Is there a way to make it perform fast, regardless of whether the range (@Low, @High)
is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?
Would an R-tree/spatial index help?
Adding indexes, rewriting the query, re-designing the table, there is no limitation.
Best Answer
You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:
--testbed and
lexikon
dummy data:granule
analysis (mostly for information and tuning):function for scanning high frequencies first:
results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)
first using the function we've written:
and then with a simple index scan:
Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the
limit
clause and size oflset
ranges sought.