Sql-server – Optimal way to record and retrieve a time-dependent value

database-designsql serversql server 2014

I have a table that has transactions for a bus (boarding riders). Given the route ID and the date, I need to look up in another table what service type it was doing that day. Bus schedules change at most every 6 months or so, with most going years unchanged.

Currently the schedule table is defined like so:

CREATE TABLE [dbo].[Routes](
    [ID] [int] NOT NULL,
    [RouteID] [int] NOT NULL,
    [Type] [varchar](50) NOT NULL,
    [StartDate] [datetime] NOT NULL,
PRIMARY KEY CLUSTERED 
(
    [ID] ASC
));

An example might look like:

ID  RouteID  Type          StartDate
--  -------  ------------  ----------
 1      301  Standard      2015-01-01
 2      301  Discontinued  2016-06-01
 3      302  Standard      2015-01-01
 4      302  ParaTrans     2017-01-01

So, if I have a transaction from 2015-04-20 for RouteID 301, I want to get back "Standard", but if the transaction is from 2018-01-20, it should return "Discontinued". For transactions prior to 2015-01-01, it should return NULL (or "", or anything other than a result that might conflict with a valid answer, i.e., "Standard", "Paratrans", or "Discontinued").

Basically, the table represents that route 301 was a standard route between 2015-01-01 and 2016-05-31 (and thus any transactions during that period should be categorized as "Standard"), then it was Discontinued on 2016-06-01 (through current day, implicitly as there is no later schedule change noted), while 302 was a Standard route from 2015-01-01 through 2016-12-31, then a ParaTrans(it) route after.

Route   Type          Start       End
-----   ----          -----       ---
301
        Standard      2015-01-01  2016-05-31
        Discontinued  2016-06-01  Present
302
        Standard      2015-01-01  2016-12-31
        ParaTrans     2017-01-01  Present

Currently, the query to do this looks like this:

SELECT
    TRANSIT_DAY, 
    ROUTE_ID, 
    (SELECT TOP (1) Type FROM Routes
     WHERE (RouteID = dbo.DAILY_SALES_DETAIL.ROUTE_ID) 
     AND (StartDate <= dbo.DAILY_SALES_DETAIL.TRANSIT_DAY)
     ORDER BY StartDate DESC) AS NCTD_MODE 
FROM dbo.DAILY_SALES_DETAIL

Questions

What I'd like to know is: Is this the most effective combination of (a) structure of the Routes table and (b) query to achieve this result? In other words, could a more efficient query be used with the existing structure? Could a change to the route table allow for a more efficient query?

Considerations

The table of transactions is imported from a vendor on a daily basis, and so changing the schema of that table is not trivial and it is preferred to be avoided. More importantly, this lookup is used across a number of tables and databases using transactions or other bus related data sourced form multiple vendors; this is just a single example. We have one vendor (and thus one database) for monetary transactions, another for rider count, and still another for performance, and so on, with the route number and date being the only reliably consistent identifier across them all.

The route table has an index of (RouteID, StartDate). Currently there are 56 rows in the Route table and 26M rows in the transaction table. The route table is comprised of 45 routes, and currently there are no routes with more than 2 rows, or one change. There is no limit to how many changes a single route could have, but I include this stat to show that the number is likely to remain small for the foreseeable future.

I can add any required indexes in order to optimize a suggested query. The question is more about finding the best strategy, assuming all reasonable optimizations are made to the considered strategies, than about finding the best optimization of a particular strategy.

db<>fiddle here

Best Answer

You could increase performance of your setup, as shown in your question, by changing the dbo.Routes table to:

CREATE TABLE dbo.Routes(
      RouteID int NOT NULL
    , [Type] varchar(50) NOT NULL
    , StartDate datetime NOT NULL
    , CONSTRAINT PK_Routes
        PRIMARY KEY CLUSTERED
        (RouteID, StartDate DESC)
) WITH (DATA_COMPRESSION = PAGE)
ON [PRIMARY];

The key here is we're defining the clustered index, which is the table, on the compound of RouteID and StartDate DESC. This provides the data in exactly the way that is most efficient for the query you've written. The caveat emptor here is inserts into the dbo.Routes for an existing route with a new date will cause page splits to occur since we'll be filling rows in descending order of date. Having said that, with a small number of rows in the Route table, and with the occasional index maintenance, this should not be a big concern.

Instead of doing that, I would consider modifying the dbo.Routes table to include an EndDate column. This eliminates the need to do a subquery with TOP(1) and ORDER BY .... Something like:

CREATE TABLE dbo.Routes(
      RouteID int NOT NULL
    , [Type] varchar(50) NOT NULL
    , StartDate datetime NOT NULL
    , EndDate datetime NOT NULL
    , CONSTRAINT PK_Routes
        PRIMARY KEY CLUSTERED
        (RouteID, StartDate ASC)
);

Note the clustered index is now on (RouteID, StartDate ASC).

The query can now use an INNER JOIN, instead of the correlated subquery, and looks like:

SELECT
      t.TRANSIT_DAY
    , t.ROUTE_ID
    ,  NCTD_MODE = r.Type 
FROM Transactions t
    INNER JOIN dbo.Routes r ON t.ROUTE_ID = r.RouteID 
        AND t.TRANSIT_DAY >= r.StartDate 
        AND t.TRANSIT_DAY < r.EndDate
ORDER BY t.TRANSIT_DAY
    , t.ROUTE_ID;

This allows SQL Server to do a simple inner loop join to obtain results. Granted, if you are returning a large amount of rows, there will be a significant sort required, that will probably spill to tempdb.

Using the MCVE I've shown below, we can compare plans for the the two variants. The first plan is your original query with the correlated subquery. The second plan is with the EndDate column included.

The 2nd variant has a plan cost which is about 4 times lower than the 1st variant. The sort operators in both plans are requesting 108MB of memory and spilling over 9,000 pages to tempdb - however it's fairly unlikely you'll be requesting the entire result set vs getting a single route, or perhaps a date range. If you add a filter for a single route, there is no large memory grant or spill to tempdb.

What follows is a sample MCVE with 10,000 route rows, and 1,000,000 transaction rows, that can be used to run tests against various designs:

Do this in tempdb to avoid any "accidents" with real tables.

USE tempdb;

Drop the tables if they exist (this works on SQL Server 2016+):

DROP TABLE IF EXISTS dbo.Routes;
DROP TABLE IF EXISTS dbo.Transactions;

Create the dbo.Routes table, with a clustered index on RouteID, StartDate DESC:

CREATE TABLE dbo.Routes(
        RouteID int NOT NULL
    , [Type] varchar(50) NOT NULL
    , StartDate datetime NOT NULL
    , CONSTRAINT PK_Routes
        PRIMARY KEY CLUSTERED
        (RouteID, StartDate DESC)
);

Insert 10,000 route rows:

;WITH src AS (
    SELECT t.n
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))t(n)
)
, src2 AS (
SELECT RouteID = (s1.n * 1000) + (s2.n * 100) + (s3.n * 10)
    , Type = REPLICATE(CHAR(65 + CONVERT(int, CRYPT_GEN_RANDOM(1) % 26)), 50)
FROM src s1
    CROSS JOIN src s2
    CROSS JOIN src s3
    CROSS JOIN src s4
)
INSERT INTO dbo.Routes (RouteID, [Type], StartDate)
SELECT s.RouteID
    , s.Type
    , StartDate = DATEADD(DAY, ROW_NUMBER() OVER (PARTITION BY RouteID ORDER BY s.RouteID) - 1, '1997-01-01T00:00:00')
FROM src2 s

Create the dbo.Transactions, with a clustered index on ROUTE_ID, TRANSIT_DAY. Building the clustered index like that optimizes queries that filter on both route and day.

CREATE TABLE dbo.Transactions(
     TRANSIT_DAY datetime NOT NULL
    , ROUTE_ID int NOT NULL
    , CONSTRAINT PK_Transactions
        PRIMARY KEY CLUSTERED
        (ROUTE_ID, TRANSIT_DAY)
);

Insert 1,000,000 rows into the dbo.Transactions table:

;WITH src AS (
    SELECT t.n
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))t(n)
)
INSERT INTO dbo.Transactions (TRANSIT_DAY, ROUTE_ID)
SELECT DATEADD(DAY, CONVERT(int, CRYPT_GEN_RANDOM(1)), '1997-01-01') + DATEADD(MILLISECOND, ABS(CONVERT(int, CRYPT_GEN_RANDOM(4))), '00:00:00')
    , r.RouteID
FROM dbo.Routes r
CROSS JOIN src s1
CROSS JOIN src s2

For a Routes table with an EndDate column that can be used for comparative tests, I used this:

CREATE TABLE dbo.RoutesEndDate(
      RouteID int NOT NULL
    , [Type] varchar(50) NOT NULL
    , StartDate datetime NOT NULL
    , EndDate datetime NOT NULL
    , CONSTRAINT PK_RoutesEndDate
        PRIMARY KEY CLUSTERED
        (RouteID, StartDate ASC)
);

INSERT INTO dbo.RoutesEndDate (RouteID, [Type], StartDate, EndDate)
SELECT r.RouteID
    , R.Type
    , R.StartDate
    , EndDate = COALESCE(LEAD(r.StartDate) OVER (PARTITION BY r.RouteID ORDER BY r.StartDate), GETDATE())
FROM dbo.Routes r

Query both tables for a specific route:

SELECT
      t.TRANSIT_DAY
    , t.ROUTE_ID
    ,  NCTD_MODE = (
        SELECT TOP (1) Type 
        FROM Routes r
        WHERE (r.RouteID = t.ROUTE_ID) AND (r.StartDate <= t.TRANSIT_DAY)
        ORDER BY r.StartDate DESC
        ) 
FROM Transactions t
WHERE t.ROUTE_ID = 750
ORDER BY t.TRANSIT_DAY
    , t.ROUTE_ID;

The plan for the above query:

I/O and time Stats:

Table 'Routes'. Scan count 1000, logical reads 2142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Transactions'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 2 ms,  elapsed time = 2 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

Query for all transactions/routes:

SELECT
      t.TRANSIT_DAY
    , t.ROUTE_ID
    ,  NCTD_MODE = (
        SELECT TOP (1) Type 
        FROM Routes r
        WHERE (r.RouteID = t.ROUTE_ID) AND (r.StartDate <= t.TRANSIT_DAY)
        ORDER BY r.StartDate DESC
        ) 
FROM Transactions t
ORDER BY t.TRANSIT_DAY
    , t.ROUTE_ID;

The plan:

Nasty spill to tempdb for sort operator:

If we modify the clustered index on dbo.Transactions to be (TRANSIT_DAY, ROUTE_ID), and re-run the full query, we see a plan without that ugly sort and spill-to-tempdb:

Related Solutions

Bus time schedule database design

Since you really only have to keep track of times between stations on each route, you only need to keep the start time of each route, the rest can be calculated easily by storing the time delta value for each line stop (the time between the current station and the last station ), instead of keeping time data for each route_stop. You also need to maintain the order of the stops on the line, and if it's a circular route you simply put the stops twice into the chain with different ordering numbers ( so each stop on a circular route is inserted twice into the route_station relation table with different order number ).

You can of course keep the time for each stop if you want, but that seems redundant and makes it harder adding stops to a route, since you'd then have to recalculate all bus stop times, instead of simply adding the new stop and updating the delta time of the next stop.

I'd probably start with a data model like this ( but of course this needs to be expanded if you want to add information about the buses and drivers etc ) :

lines (id, name, ...)

routes (id, name, line_id, ...)

stops (id, location)

line_stops (id, line_id, stop_id, order, time_delta)

route_start_times (id, route_id, start_time)

Storing bus routes in a database

For all business analysis leading to database architecture, I recommend writing rules:

A route has 2 or more stations
A Station can be used by many routes
Stations on a route come in a specific order

The 1st and 2nd rules as you noticed implies a many to many relationship so you concluded rightfully to create routeStations.

The 3rd rule is the interesting one. It implies that an extra column is needed to fit the requirement. Where should it go? We can see that this property depends on Route AND Station. Therefore it should be located in routeStations.

I would add a column to table routeStations called "stationOrder".

+-------------+---------------+---------------
| routeId(fk) | stationId(fk) | StationOrder |
+-------------+---------------+---------------
|     1       |       1       |       3      |
+-------------+---------------+---------------
|     1       |       3       |       1      |
+-------------+---------------+---------------
|     1       |       4       |       2      |
+-------------+---------------+---------------
|     2       |       1       |       1      |
+-------------+---------------+---------------
|     2       |       4       |       2      |
+-------------+---------------+---------------

Then querying becomes easy:

select rs.routeID,s.Name
from routeStations rs
join
Stations s
on rs.stationId=s.StationId
where rs.routeId=1
order by rs.StationOrder;

+-------------+---------------+
| routeId(fk) | stationId(fk) |
+-------------+---------------+
|     1       |       C       |
+-------------+---------------+
|     1       |       D       |
+-------------+---------------+
|     1       |       A       |
+-------------+---------------+

Notes:

I fixed the StationId in RouteStations in my example. You are using the StationName as the Id.
If you don't use a route name, then there's not even a need for routeId since you can get that from routeStations
Even if you would link to the route table, your database optimizer would notice it doesn't need that extra link and simply remove the extra steps.

To develop on note 3, I've built the use case:

This is Oracle 12c Enterprise.

Note that in the execution plan below that table routes isn't used at all. the Cost Base Optimizer (CBO) knows it can get the routeId directly from routeStations's primary key (step 5, INDEX RANGE SCAN on ROUTESTATIONS_PK, Predicate Information 5 - access("RS"."ROUTEID"=1))

--Table ROUTES
create sequence routeId_Seq start with 1 increment by 1 maxvalue 9999999999999 cache 1000;

CREATE TABLE routes
(
  routeId  INTEGER NOT NULL
);


ALTER TABLE routes ADD (
  CONSTRAINT routes_PK
  PRIMARY KEY
  (routeId)
  ENABLE VALIDATE);

insert into routes values (routeId_Seq.nextval);
insert into routes values (routeId_Seq.nextval);
commit;

--TABLE STATIONS  
create sequence stationId_seq start with 1 increment by 1 maxvalue 9999999999999 cache 1000;

create table stations(
   stationID INTEGER NOT NULL,
   name varchar(50) NOT NULL
);

ALTER TABLE stations ADD (
  CONSTRAINT stations_PK
  PRIMARY KEY
  (stationId)
  ENABLE VALIDATE);

insert into stations values (stationId_seq.nextval,'A');
insert into stations values (stationId_seq.nextval,'B');
insert into stations values (stationId_seq.nextval,'C');
insert into stations values (stationId_seq.nextval,'D');
commit;
--

--Table ROUTESTATIONS 
CREATE TABLE routeStations
(
  routeId       INTEGER NOT NULL,
  stationId     INTEGER NOT NULL,
  stationOrder  INTEGER NOT NULL
);


ALTER TABLE routeStations ADD (
  CONSTRAINT routeStations_PK
  PRIMARY KEY
  (routeId, stationId)
  ENABLE VALIDATE);

ALTER TABLE routeStations ADD (
  FOREIGN KEY (routeId) 
  REFERENCES ROUTES (ROUTEID)
  ENABLE VALIDATE,
  FOREIGN KEY (stationId) 
  REFERENCES STATIONS (stationId)
  ENABLE VALIDATE);

insert into routeStations values (1,1,3);
insert into routeStations values (1,3,1);
insert into routeStations values (1,4,2);
insert into routeStations values (2,1,1);
insert into routeStations values (2,4,2);
commit;

explain plan for select rs.routeID,s.Name
from ndefontenay.routeStations rs
join
ndefontenay.routes r
on r.routeId=rs.routeId
join ndefontenay.stations s
on rs.stationId=s.stationId
where rs.routeId=1
order by rs.StationOrder;

set linesize 1000
set pages 500
select * from table (dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2617709240                                                                                                                                                                                                                                                                                 

---------------------------------------------------------------------------------------------------                                                                                                                                                                                                         
| Id  | Operation                      | Name             | Rows  | Bytes | Cost (%CPU)| Time     |                                                                                                                                                                                                         
---------------------------------------------------------------------------------------------------                                                                                                                                                                                                         
|   0 | SELECT STATEMENT               |                  |     1 |    79 |     1 (100)| 00:00:01 |                                                                                                                                                                                                         
|   1 |  SORT ORDER BY                 |                  |     1 |    79 |     1 (100)| 00:00:01 |                                                                                                                                                                                                         
|   2 |   NESTED LOOPS                 |                  |       |       |            |          |                                                                                                                                                                                                         
|   3 |    NESTED LOOPS                |                  |     1 |    79 |     0   (0)| 00:00:01 |                                                                                                                                                                                                         
|   4 |     TABLE ACCESS BY INDEX ROWID| ROUTESTATIONS    |     1 |    39 |     0   (0)| 00:00:01 |                                                                                                                                                                                                         
|*  5 |      INDEX RANGE SCAN          | ROUTESTATIONS_PK |     1 |       |     0   (0)| 00:00:01 |                                                                                                                                                                                                         
|*  6 |     INDEX UNIQUE SCAN          | STATIONS_PK      |     1 |       |     0   (0)| 00:00:01 |                                                                                                                                                                                                         
|   7 |    TABLE ACCESS BY INDEX ROWID | STATIONS         |     1 |    40 |     0   (0)| 00:00:01 |                                                                                                                                                                                                         
---------------------------------------------------------------------------------------------------                                                                                                                                                                                                         

Predicate Information (identified by operation id):                                                                                                                                                                                                                                                         
---------------------------------------------------                                                                                                                                                                                                                                                         

   5 - access("RS"."ROUTEID"=1)                                                                                                                                                                                                                                                                             
   6 - access("RS"."STATIONID"="S"."STATIONID")

Now the fun part, let's add a column name to the route table. Now there's a column we actually need in "routes". The CBO uses the index to find the rowID for route 1, then accesses the table (table access by index rowid) and grabs the column "routes.name".

ALTER TABLE ROUTES
 ADD (name  VARCHAR2(50));

update routes set name='Old Town' where routeId=1;
update routes set name='North County' where routeId=2;
commit;

explain plan for select r.name as routeName,s.Name as stationName
from routeStations rs
join
routes r
on r.routeId=rs.routeId
join stations s
on rs.stationId=s.stationId
where rs.routeId=1
order by rs.StationOrder;

set linesize 500
set pages 500
select * from table (dbms_xplan.display);

PLAN_TABLE_OUTPUT                                                                                                                                                                                                                                                                                           
---------------------------------------------------------------------------------------------------
Plan hash value: 3368128430                                                                                                                                                                                                                                                                                 

----------------------------------------------------------------------------------------------------                                                                                                                                                                                                        
| Id  | Operation                       | Name             | Rows  | Bytes | Cost (%CPU)| Time     |                                                                                                                                                                                                        
----------------------------------------------------------------------------------------------------                                                                                                                                                                                                        
|   0 | SELECT STATEMENT                |                  |     1 |   119 |     1 (100)| 00:00:01 |                                                                                                                                                                                                        
|   1 |  SORT ORDER BY                  |                  |     1 |   119 |     1 (100)| 00:00:01 |                                                                                                                                                                                                        
|   2 |   NESTED LOOPS                  |                  |       |       |            |          |                                                                                                                                                                                                        
|   3 |    NESTED LOOPS                 |                  |     1 |   119 |     0   (0)| 00:00:01 |                                                                                                                                                                                                        
|   4 |     NESTED LOOPS                |                  |     1 |    79 |     0   (0)| 00:00:01 |                                                                                                                                                                                                        
|   5 |      TABLE ACCESS BY INDEX ROWID| ROUTES           |     1 |    40 |     0   (0)| 00:00:01 |                                                                                                                                                                                                        
|*  6 |       INDEX UNIQUE SCAN         | ROUTES_PK        |     1 |       |     0   (0)| 00:00:01 |                                                                                                                                                                                                        
|   7 |      TABLE ACCESS BY INDEX ROWID| ROUTESTATIONS    |     1 |    39 |     0   (0)| 00:00:01 |                                                                                                                                                                                                        
|*  8 |       INDEX RANGE SCAN          | ROUTESTATIONS_PK |     1 |       |     0   (0)| 00:00:01 |                                                                                                                                                                                                        
|*  9 |     INDEX UNIQUE SCAN           | STATIONS_PK      |     1 |       |     0   (0)| 00:00:01 |                                                                                                                                                                                                        
|  10 |    TABLE ACCESS BY INDEX ROWID  | STATIONS         |     1 |    40 |     0   (0)| 00:00:01 |                                                                                                                                                                                                        
----------------------------------------------------------------------------------------------------                                                                                                                                                                                                        

Predicate Information (identified by operation id):                                                                                                                                                                                                                                                         
---------------------------------------------------                                                                                                                                                                                                                                                         

   6 - access("R"."ROUTEID"=1)                                                                                                                                                                                                                                                                              
   8 - access("RS"."ROUTEID"=1)                                                                                                                                                                                                                                                                             
   9 - access("RS"."STATIONID"="S"."STATIONID")