This is a hard problem to solve in general, but there are a couple of things we can do to help the optimizer choose a plan. This script creates a table with 10,000 rows with a known pseudo-random distribution of rows to illustrate:
CREATE TABLE dbo.SomeDateTable
(
Id INTEGER IDENTITY(1, 1) PRIMARY KEY NOT NULL,
StartDate DATETIME NOT NULL,
EndDate DATETIME NOT NULL
);
GO
SET STATISTICS XML OFF
SET NOCOUNT ON;
DECLARE
@i INTEGER = 1,
@s FLOAT = RAND(20120104),
@e FLOAT = RAND();
WHILE @i <= 10000
BEGIN
INSERT dbo.SomeDateTable
(
StartDate,
EndDate
)
VALUES
(
DATEADD(DAY, @s * 365, {d '2009-01-01'}),
DATEADD(DAY, @s * 365 + @e * 14, {d '2009-01-01'})
)
SELECT
@s = RAND(),
@e = RAND(),
@i += 1
END
The first question is how to index this table. One option is to provide two indexes on the DATETIME
columns, so the optimizer can at least choose whether to seek on StartDate
or EndDate
.
CREATE INDEX nc1 ON dbo.SomeDateTable (StartDate, EndDate)
CREATE INDEX nc2 ON dbo.SomeDateTable (EndDate, StartDate)
Naturally, the inequalities on both StartDate
and EndDate
mean that only one column in each index can support a seek in the example query, but this is about the best we can do. We might consider making the second column in each index an INCLUDE
rather than a key, but we might have other queries that can perform an equality seek on the leading column and an inequality seek on the second column. Also, we may get better statistics this way. Anyway...
DECLARE
@StartDateBegin DATETIME = {d '2009-08-01'},
@StartDateEnd DATETIME = {d '2009-10-15'},
@EndDateBegin DATETIME = {d '2009-08-05'},
@EndDateEnd DATETIME = {d '2009-10-22'}
SELECT
COUNT_BIG(*)
FROM dbo.SomeDateTable AS sdt
WHERE
sdt.StartDate BETWEEN @StartDateBegin AND @StartDateEnd
AND sdt.EndDate BETWEEN @EndDateBegin AND @EndDateEnd
This query uses variables, so in general the optimizer will guess at selectivity and distribution, resulting in a guessed cardinality estimate of 81 rows. In fact, the query produces 2076 rows, a discrepancy that might be important in a more complex example.
On SQL Server 2008 SP1 CU5 or later (or R2 RTM CU1) we can take advantage of the Parameter Embedding Optimization to get better estimates, simply by adding OPTION (RECOMPILE)
to the SELECT
query above. This causes a compilation just before the batch executes, allowing SQL Server to 'see' the real parameter values and optimize for those. With this change, the estimate improves to 468 rows (though you do need to check the runtime plan to see this). This estimate is better than 81 rows, but still not all that close. The modelling extensions enabled by trace flag 2301 may help in some cases, but not with this query.
The problem is where the rows qualified by the two range searches overlap. One of the simplifying assumptions made in the optimizer's costing and cardinality estimation component is that predicates are independent (so if both have a selectivity of 50%, the result of applying both is assumed to qualify 50% of 50% = 25% of the rows). Where this sort of correlation is a problem, we can often work around it with multi-column and/or filtered statistics. With two ranges with unknown start and end points, this becomes impractical. This is where we sometimes have to resort to rewriting the query to a form that happens to produce a better estimate:
SELECT COUNT(*) FROM
(
SELECT
sdt.Id
FROM dbo.SomeDateTable AS sdt
WHERE
sdt.StartDate BETWEEN @StartDateBegin AND @StartDateEnd
INTERSECT
SELECT
sdt.Id
FROM dbo.SomeDateTable AS sdt
WHERE
sdt.EndDate BETWEEN @EndDateBegin AND @EndDateEnd
) AS intersected (id)
OPTION (RECOMPILE)
This form happens to produce a runtime estimate of 2110 rows (versus 2076 actual). Unless you have TF 2301 on, in which case the more advanced modelling techniques see through the trick and produce exactly the same estimate as before: 468 rows.
One day SQL Server might gain native support for intervals. If that comes with good statistical support, developers might dread tuning query plans like this a little less.
I am assuming data type text
for the relevant columns.
CREATE TABLE prefix (code text, name text, price int);
CREATE TABLE num (number text, time int);
"Simple" Solution
SELECT DISTINCT ON (1)
n.number, p.code
FROM num n
JOIN prefix p ON right(n.number, -1) LIKE (p.code || '%')
ORDER BY n.number, p.code DESC;
Key elements:
DISTINCT ON
is a Postgres extension of the SQL standard DISTINCT
. Find a detailed explanation for the used query technique in this related answer on SO.
ORDER BY p.code DESC
picks the longest match, because '1234'
sorts after '123'
(in ascending order).
Simple SQL Fiddle.
Without index, the query would run for a very long time (didn't wait to see it finish). To make this fast, you need index support. The trigram indexes you mentioned, supplied by the additional module pg_trgm
are a good candidate. You have to choose between GIN and GiST index. The first character of the numbers is just noise and can be excluded from the index, making it a functional index in addition.
In my tests, a functional trigram GIN index won the race over a trigram GiST index (as expected):
CREATE INDEX num_trgm_gin_idx ON num USING gin (right(number, -1) gin_trgm_ops);
Advanced dbfiddle here.
All test results are from a local Postgres 9.1 test installation with a reduced setup: 17k numbers and 2k codes:
- Total runtime: 1719.552 ms (trigram GiST)
- Total runtime: 912.329 ms (trigram GIN)
Much faster yet
Failed attempt with text_pattern_ops
Once we ignore the distracting first noise character, it comes down to basic left anchored pattern match. Therefore I tried a functional B-tree index with the operator class text_pattern_ops
(assuming column type text
).
CREATE INDEX num_text_pattern_idx ON num(right(number, -1) text_pattern_ops);
This works excellently for direct queries with a single search term and makes the trigram index look bad in comparison:
SELECT * FROM num WHERE right(number, -1) LIKE '2345%'
- Total runtime: 3.816 ms (trgm_gin_idx)
- Total runtime: 0.147 ms (text_pattern_idx)
However, the query planner will not consider this index for joining two tables. I have seen this limitation before. I don't have a meaningful explanation for this, yet.
Partial / functional B-tree indexes
The alternative it to use equality checks on partial strings with partial indexes. This can be used in a JOIN
.
Since we typically only have a limited number of different lengths
for prefixes, we can build a solution similar to the one presented here with partial indexes.
Say, we have prefixes ranging from 1 to 5 characters. Create a number of partial functional indexes, one for every distinct prefix length:
CREATE INDEX prefix_code_idx5 ON prefix(code) WHERE length(code) = 5;
CREATE INDEX prefix_code_idx4 ON prefix(code) WHERE length(code) = 4;
CREATE INDEX prefix_code_idx3 ON prefix(code) WHERE length(code) = 3;
CREATE INDEX prefix_code_idx2 ON prefix(code) WHERE length(code) = 2;
CREATE INDEX prefix_code_idx1 ON prefix(code) WHERE length(code) = 1;
Since these are partial indexes, all of them together are barely larger than a single complete index.
Add matching indexes for numbers (taking the leading noise character into account):
CREATE INDEX num_number_idx5 ON num(substring(number, 2, 5)) WHERE length(number) >= 6;
CREATE INDEX num_number_idx4 ON num(substring(number, 2, 4)) WHERE length(number) >= 5;
CREATE INDEX num_number_idx3 ON num(substring(number, 2, 3)) WHERE length(number) >= 4;
CREATE INDEX num_number_idx2 ON num(substring(number, 2, 2)) WHERE length(number) >= 3;
CREATE INDEX num_number_idx1 ON num(substring(number, 2, 1)) WHERE length(number) >= 2;
While these indexes only hold a substring each and are partial, each covers most or all of the table. So they are much larger together than a single total index - except for long numbers. And they impose more work for write operations. That's the cost for amazing speed.
If that cost is too high for you (write performance is important / too many write operations / disk space an issue), you can skip these indexes. The rest is still faster, if not quite as fast as it could be ...
If numbers are never shorter then n
characters, drop redundant WHERE
clauses from some or all, and also drop the corresponding WHERE
clause from all following queries.
Recursive CTE
With all the setup so far I was hoping for very elegant solution with a recursive CTE:
WITH RECURSIVE cte AS (
SELECT n.number, p.code, 4 AS len
FROM num n
LEFT JOIN prefix p
ON substring(number, 2, 5) = p.code
AND length(n.number) >= 6 -- incl. noise character
AND length(p.code) = 5
UNION ALL
SELECT c.number, p.code, len - 1
FROM cte c
LEFT JOIN prefix p
ON substring(number, 2, c.len) = p.code
AND length(c.number) >= c.len+1 -- incl. noise character
AND length(p.code) = c.len
WHERE c.len > 0
AND c.code IS NULL
)
SELECT number, code
FROM cte
WHERE code IS NOT NULL;
- Total runtime: 1045.115 ms
However, while this query isn't bad - it performs about as good as the simple version with a trigram GIN index - it doesn't deliver what I was aiming for. The recursive term is planned once only, so it can't use the best indexes. Only the non-recursive term can.
UNION ALL
Since we are dealing with a small number of recursions we can just spell them out iteratively. This allows optimized plans for each of them. (We lose the recursive exclusion of already successful numbers, though. So there is still some room for improvement, especially for a wider range of prefix lengths)):
SELECT DISTINCT ON (1) number, code
FROM (
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 5) = p.code
AND length(n.number) >= 6 -- incl. noise character
AND length(p.code) = 5
UNION ALL
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 4) = p.code
AND length(n.number) >= 5
AND length(p.code) = 4
UNION ALL
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 3) = p.code
AND length(n.number) >= 4
AND length(p.code) = 3
UNION ALL
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 2) = p.code
AND length(n.number) >= 3
AND length(p.code) = 2
UNION ALL
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 1) = p.code
AND length(n.number) >= 2
AND length(p.code) = 1
) x
ORDER BY number, code DESC;
- Total runtime: 57.578 ms (!!)
A breakthrough, finally!
SQL function
Wrapping this into an SQL function removes the query planning overhead for repeated use:
CREATE OR REPLACE FUNCTION f_longest_prefix()
RETURNS TABLE (number text, code text) LANGUAGE sql AS
$func$
SELECT DISTINCT ON (1) number, code
FROM (
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 5) = p.code
AND length(n.number) >= 6 -- incl. noise character
AND length(p.code) = 5
UNION ALL
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 4) = p.code
AND length(n.number) >= 5
AND length(p.code) = 4
UNION ALL
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 3) = p.code
AND length(n.number) >= 4
AND length(p.code) = 3
UNION ALL
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 2) = p.code
AND length(n.number) >= 3
AND length(p.code) = 2
UNION ALL
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(number, 2, 1) = p.code
AND length(n.number) >= 2
AND length(p.code) = 1
) x
ORDER BY number, code DESC
$func$;
Call:
SELECT * FROM f_longest_prefix_sql();
- Total runtime: 17.138 ms (!!!)
PL/pgSQL function with dynamic SQL
This plpgsql function is much like the recursive CTE above, but the dynamic SQL with EXECUTE
forces the query to be re-planned for every iteration. Now it makes use of all the tailored indexes.
Additionally this works for any range of prefix lengths. The function takes two parameters for the range, but I prepared it with DEFAULT
values, so it works without explicit parameters, too:
CREATE OR REPLACE FUNCTION f_longest_prefix2(_min int = 1, _max int = 5)
RETURNS TABLE (number text, code text) LANGUAGE plpgsql AS
$func$
BEGIN
FOR i IN REVERSE _max .. _min LOOP -- longer matches first
RETURN QUERY EXECUTE '
SELECT n.number, p.code
FROM num n
JOIN prefix p
ON substring(n.number, 2, $1) = p.code
AND length(n.number) >= $1+1 -- incl. noise character
AND length(p.code) = $1'
USING i;
END LOOP;
END
$func$;
The final step cannot be wrapped into the one function easily.
Either just call it like this:
SELECT DISTINCT ON (1)
number, code
FROM f_longest_prefix_prefix2() x
ORDER BY number, code DESC;
Or use another SQL function as wrapper:
CREATE OR REPLACE FUNCTION f_longest_prefix3(_min int = 1, _max int = 5)
RETURNS TABLE (number text, code text) LANGUAGE sql AS
$func$
SELECT DISTINCT ON (1)
number, code
FROM f_longest_prefix_prefix2($1, $2) x
ORDER BY number, code DESC
$func$;
Call:
SELECT * FROM f_longest_prefix3();
A bit slower due to required planning overhead. But more versatile than SQL and shorter for longer prefixes.
Best Answer
If you use the
>=
and<
operators you can return the required range by "increasing the second argument" by 1. Technically, replacing the last character ('5'
) with the next character ('6'
) in the character set.Below is a query that returns all the rows in your range of
A010
andA025
. To get this I pass'A026'
as the second argument. The expression< 'A026'
includes the value'A0259456546'
.