I am writing a table valued function to geocode an address based on its Suburb, State and Postcode. I try to geocode the address using different methods, in order of decreasing accuracy:
- Exact match on unique suburb-postcode-state combination
- Exact match on unique suburb-postcode combination
- Exact match on unique suburb-state combination
- Exact match on unique postcode
- Approximate match by non-unique Postcode, where all Suburbs with this Postcode are within 5 km of one another.
(I am working with a geographical region where Suburb-Postcode-State relationships are all many-to-many. In other words, one Suburb can have multiple postcodes; one postcode can have multiple suburbs and may exist in different states.)
Following is an extract from the table-valued function:
ALTER FUNCTION [geocode].[tvfn_Customer_Suburb_From_Address]
(
@Suburb NVARCHAR(100),
@State NVARCHAR(100),
@Postcode NVARCHAR(100),
@Country NVARCHAR(100)
)
RETURNS TABLE
AS
RETURN
(
SELECT TOP 1 *
FROM (
-- Unique suburb-postcode-state combinations
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 4 -- Exact match by unique Postcode, Suburb and State
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE s.[Is_Active] = 1
AND s.[Suburb] = @Suburb
AND s.[State] = @State
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
UNION ALL
-- -- Unique suburb-postcode combinations
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 3 -- Exact match by unique Postcode & Suburb
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE EXISTS ( SELECT *
FROM geocode.tSuburbs_XX
WHERE Is_Active = 1
AND Suburb = s.Suburb AND Postcode = s.Postcode
GROUP BY Postcode, Suburb
HAVING COUNT(*) = 1
)
AND s.Is_Active = 1
AND s.[Suburb] = @Suburb
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
UNION ALL
-- Exact match by unique Suburb and State
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 6 -- Exact match by unique Suburb and State
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE EXISTS ( SELECT *
FROM geocode.tSuburbs_XX
WHERE Is_Active = 1 AND Is_PO_Box = 0 -- Exclude PO Boxes
AND Suburb = s.Suburb AND Postcode = s.Postcode
GROUP BY Suburb, Postcode
HAVING COUNT(*) = 1
)
AND s.Is_Active = 1
AND s.[Suburb] = @Suburb
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
UNION ALL
-- Exact match by unique Postcode
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 2 -- Exact match by unique Postcode
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE EXISTS ( SELECT *
FROM geocode.tSuburbs_XX
WHERE Is_Active = 1
AND Postcode = s.Postcode
GROUP BY Postcode
HAVING COUNT(*) = 1
)
AND s.Is_Active = 1
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
-- Perform this extra check to make sure we don't match a postcode in a wrong country
AND ( @Country IN ('AAA', 'BBB', 'CCC')
OR @State IN ('MMM', 'NNN', 'OOO', 'PPP')
)
UNION ALL
-- Approximate match by non-unique Postcode, where all Suburbs with this Postcode are within 5 km of one another.
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 5
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM [geocode].[tPostcode_Distances] pd
INNER JOIN geocode.tSuburbs_XX s
ON pd.Approx_Suburb_DID = s.Suburb_DID
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE s.Is_Active = 1
AND pd.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
-- Perform this extra check to make sure we don't match a postcode in a wrong country
AND ( @Country IN ('AAA', 'BBB', 'CCC')
OR @State IN ('MMM', 'NNN', 'OOO', 'PPP')
)
AND pd.Max_Distance <= 5000 -- within 5 km
) t
)
The above function works, but I would like to know if it can be improved. In particular, is it possible to force SQL Server to stop processing the SELECT
statements after the first SELECT
that returns a result-set (because we're interested in the first matching result only – TOP 1
)?
Update
Thank you for suggestions thus far. I will add a [Priority]
column as suggested in several answers and comments and an ORDER BY
clause to ensure I get the top result.
I will also add a WITH SCHEMABINDING
to the TVFN so SQL Server can parallelise the plan. While we're on this subject, the using a multi-statement table-valued function is a good idea (Thank you to Paul White), but multi-statement TVFN's always force a serial plan.
I will now try Lennart's answer, in which he suggested using a CTE.
Best Answer
If you must use a single query (as required by a single inline function), you can use one of the two options below (illustrated in my recent answer to Relating 2 tables with possible wildcards?):
Option 1
Use multiple
APPLY
clauses with a startup condition for each using an outer reference from a previous apply in the chain. The efficiency of this method depends on the presence of startup filters in the execution plan. Correct results are guaranteed, but plan shape is not.Option 2
Add an extra column with a constant literal to each clause of the union e.g.
[Priority] = 1
then add anORDER BY [Priority] ASC
at theTOP (1)
scope. Efficient operation depends on the plan avoiding sorts.On reflection, this is not what you want in this case, because one row from each option is required by the merge concatenation in the plan. Nevertheless, it is an option in more general situations (where the alternate inputs produce more than one row, and the first row at low cost).
In addition:
Option 3
Since you're only returning a single row, you could use a multi-statement table-valued function instead, with explicit logic to try each option in order (in separate queries), returning as soon as the first result is found. This is guaranteed to produce correct results efficiently.
Note
The current function is tecnhically nondeterministic; SQL Server could evaluate the union all in any order it chooses, potentially returning a lower-priority result before evaluating a higher priority one.