Sql-server – SQL Server: query performance (search 2 million rows)

database-designsql-server-2005

Got an interesting one for all of you SQL guru's out there. Now this search is only taking a couple of seconds, but it's quite intensive and there must be a better way. Maybe I'm expecting too much?

Simple holiday search app. 2 million holidays. Paging/Sorting around 600,000 rows.

This is the schema of the table

CREATE TABLE [dbo].[Holiday](
        [Id] [int] NOT NULL,
        [PropertyId] [int] NOT NULL,
        [Price] [int] NOT NULL,
        [Rating] [int] NOT NULL,
        [Country] [char](2) NOT NULL,
        [ResortId] [int] NOT NULL,
        [DepartureAirport] [char](3) NOT NULL,
        [DestinationAirport] [char](3) NOT NULL,
        [DepartureDate] [datetime] NOT NULL,
        [Basis] [char](2) NOT NULL,
        [Duration] [int] NOT NULL,

     CONSTRAINT [PK_Holiday] PRIMARY KEY CLUSTERED ([Id] ASC)
  )

As you can see, quite simple. We have a property, a price, duration, departure/destination airports, etc. Now, the more field that are provided the faster the search. If I have a Departure Airport, Property and Date then the search is very fast. However, if I just have a Country and nothing else, there is a lot of data to work through.

Using this CSV export of my table, there are 2 million rows in total, and around 666k just with a country code of FR, which is my example.

This, is the search query. Which returns two tables. The first is a summary, so total number of holidays that match your criteria and how many unique properties. The second table contains the actual results from the search.

--Build a temp table, and store everything we need in it
CREATE TABLE #Pricing (PropertyId int, Duration int, HolidayId int, Rating int, Price int, StartDate datetime, PropertyRow int);

INSERT INTO #Pricing
  SELECT 
    PropertyId, Duration, [Id], [Rating], [Price], DepartureDate,
    ROW_NUMBER() OVER (PARTITION BY PropertyId ORDER BY Price ASC) as PropertyRow
  FROM 
    dbo.Holiday
  WHERE 
    DepartureDate > GETDATE() AND Country = 'FR'

--Get a total number of holidays, and total number of properties
SELECT 
    COUNT(*) AS TotalHolidaysCount, 
    COUNT(DISTINCT PropertyId) AS PropertyCount
FROM 
    #Pricing

--Build the final table, which will contain all the holidays we actually want to return
DECLARE @FinalResults TABLE (HolidayId int, RowNumber int);

INSERT INTO 
    @FinalResults
  SELECT 
    HolidayId, RowNumber
  FROM
    (SELECT 
         PropertyRow, HolidayId, 
         ROW_NUMBER() OVER (order by (CASE WHEN StartDate <= '01/Apr/2013' THEN 1 ELSE 0 END) ASC, [Price] ASC) as RowNumber
     FROM  
        #Pricing 
     WHERE 
        PropertyRow = 1) as SearchResults
WHERE 
    (RowNumber > (10 * (1 - 1)) and RowNumber <= (1 * 10))
ORDER BY 
    RowNumber;

SELECT
     *
FROM
    @FinalResults
    INNER JOIN dbo.Holiday ON HolidayId = Holiday.Id

DROP TABLE #Pricing

Now, I could look into Indexing which would obviously improve performance. But what worries me, is the incredible use of Temporary tables. Surely this shouldn't be how it's done? Taking 5 seconds to search through what is ultimately a tiny amount of data. They only reason they are used is because of the need to reference the data later.

Would it maybe be worth running the query twice, instead of storing all of the data in memory? It seems a waste to select over 25% of the table into memory again and again.

Any helpful feedback would be appreciated. Not looking for the 'answer', just some assistance.

Many thanks,
Dean

Best Answer

It is very difficult to accurately suggest a solution with no ability to run testing, or see how the DB is indexed etc. But I'm going to try anyway.

You ideally need to find a balance, if your query is likely to return a lot of data and run quickly then I would err towards running the query on the main table twice, if it is likely to take a long time and return a relatively small number of rows then I would stick with the temp table approach.

Given the information you have provided in the question it appears as though there is no problem with the speed of the select, in which case I would tend to agree with you, the additional cost of inserting into a temp table, then selecting from the temp table is as much overhead as performing the select query twice. The queries can be simplified as follows:

SELECT  COUNT(*) [TotalHolidaysCount],
        COUNT(DISTINCT PropertyID) [PropertyCount]
FROM    dbo.Holiday
WHERE   DepartureDate > GETDATE() 
AND     Country = 'FR'


SELECT  *
FROM    (   SELECT  *, ROW_NUMBER() OVER (ORDER BY(CASE WHEN StartDate <= '01/Apr/2013' THEN 1 ELSE 0 END) ASC, [Price] ASC) [RowNumber]
            FROM    (   SELECT  h.*, ROW_NUMBER(PARTITION BY PropertyID, ORDER BY Price ASC) [PropertyRow]
                        FROM    dbo.Holiday
                        WHERE   DepartureDate > GETDATE() 
                        AND     Country = 'FR'
                    ) h
            WHERE   PropertyRow = 1
        ) h
WHERE   Rownumber BETWEEN 1 AND 10

If you do find this is a performance hit and you would rather cache the results of the select I'd be inclined to use table variables instead of temp tables, and only store the Holiday ID primary key, and join back to holiday on the fast indexed join of primary key = primary key as follows:

DECLARE @Results TABLE (ID INT NOT NULL PRIMARY KEY)
INSERT @Results
SELECT  ID
FROM    dbo.Holiday
WHERE   DepartureDate > GETDATE() 
AND     Country = 'FR'

SELECT  COUNT(*) [TotalHolidaysCount],
        COUNT(DISTINCT PropertyID) [PropertyCount]
FROM    dbo.Holiday h
        INNER JOIN @Results r
            ON r.ID = h.ID

SELECT  *
FROM    (   SELECT  *, ROW_NUMBER() OVER (ORDER BY(CASE WHEN StartDate <= '01/Apr/2013' THEN 1 ELSE 0 END) ASC, [Price] ASC) [RowNumber]
            FROM    (   SELECT  h.*, ROW_NUMBER(PARTITION BY PropertyID, ORDER BY Price ASC) [PropertyRow]
                        FROM    dbo.Holiday h
                                INNER JOIN @Results r
                                    ON r.ID = h.ID
                    ) h
            WHERE   PropertyRow = 1
        ) h
WHERE   Rownumber BETWEEN 1 AND 10

This way you are caching as little data as possible (one column of integers), while still retaining enough data to do a fast indexed search on dbo.Holiday.

It ultimately has to come down to looking at your execution plans, creating proper indexes and testing various approaches to find the one that best suits you.