Nested one to many relationships

database-design

I'm working on a "Bus stations" database. I want to design the next relationship:
"A bus line has many stops (one stop belongs to one line) and one stop has many arrival times (one arrival time belongs to one stop)".

So my design is like this:

Lines (1,1)<->(1,N) Stops (1,1)<->(1,N) Times

Lines(id, name, ...)
Stops(id, line_id, name, ...)
Times(id, stop_id, ...)

But when I was testing it, I realized that I needed a FOREIGN KEY line_id in Times table (because I could not distinguish between times of different bus lines). This transforms the diagram:

Lines (1,1)<->(1,N) Stops (1,1)<->(1,N) Times
  |                                       |
  |                                       | 
   ---------------------------------------
  (1,1)                                  (1,N)

I'm a little bit confused because I think this design has redundancy: the relationship between Lines and Times could not be reached by the intermediate relationships?

Best Answer

You should not need the extra FK in Times as that information can be derived from the existing relationships, but you will need to JOIN in the other tables to get at the extra property. Adding the extra key like that is sometimes a necessary optimisation but it does break "normal form" as you are duplicating data (meaning that either your business layer becomes responsible for maintaining the referential integrity of that duplicate, or you need to do the same in the database using triggers and other "powerful but be very careful with them" features of your chose database).

Your select * from times where stop_id = 1 and line_id = 1 should be something like:

SELECT times.id, times.stopid, ...
FROM   times
JOIN   stops ON times.stop_id=stops.id
WHERE  times.stop_id=1 AND stops.line_id=1

To simplify queries you can create views that abstract out the underlying structure a little, meaning you can keep the data in best form while dealing with it as it it did have the extra columns with duplicated information.

I'm not sure I'd model a timetable store that way though. I would assume that stops are separate entities with a many-to-many link between lines (unless you are counting a stop at which the no. 4, 5, and 45x lines stop at as three separate stops even though they are physically the same location). Also I'd probably want to group the lines together so that all the times for the "no 4" service are identifiable as a related collection but you can distinguish between the arrival times of each (so you can ask "what is the arrival time at X for the service that leaves Y at HH:MM" and so forth. Of course I could just be misreading your intended model! I'm thinking something like:

                                       TimedStop
                   TimedRoute          =============
Line               =============       ts_id (PK)         Stop
============       route_id (PK) ===>  route_id (FK)      =============
line_id (PK) ===>  line_id (FK)        stop_id (FK)  <=== stop_id (PK)
line_name                              arrival_time       stop_name
                                       depart_time        stop_location

Here TimedStop becomes a many-to-many relationship store for timed routes and stops. Of course there are probably several perfectly valid ways to model such a system depending on your other constraints, and the above may be rendered an incorrect model when you consider the fuller design spec (without more detail of what you are trying to model and what outputs are desired this is not possible to completely pin down, I made a fair few assumptions, which may or may not be correct, when throwing the above diagram together).

As another aside on a point of style: select * is generally best avoided in permanent code where possible. If you are instead specific about what you want out this because part of your API and other code using this output has more guarantees of what columns will be returned, even if the underlying structures are updated. It can also allow the query planner/runner to apply extra optimisations (potentially avoiding extra heap lookups or being able to replace a table scan with an index scan, and so forth).