Let's say you made the table like this
CREATE TABLE TrainEvents
(
id INT NOT NULL AUTO_INCREMENT,
station_arr_id INT NOT NULL,
station_dep_id INT NOT NULL,
arr_dt DATETIME NOT NULL,
dep_dt DATETIME NOT NULL,
...
PRIMARY KEY (id),
KEY station_arr_index (station_arr_id,arr_dt),
KEY station_dep_index (station_dep_id,dep_dt)
);
Next, you insert into the table the event
- Arriving From Station 21 on April 30, 2014
- Arrived at Station 29 at 10:00 AM on April 30, 2014
- Departing From Station 29 1 minute later
Here is that INSERT query
INSERT INTO TrainEvents (station_arr_id,arr_dt,station_dep_id,dep_dt)
VALUES (21,'2014-04-30 10:00:00',29,'2014-04-30 10:01:00');
OK, now let's see how you can query the following departure
- All Departures 10:00 AM on April 30, 2014 from Station 29
- Check Departures 15 minutes before and after
Here is that SELECT query
SET @MinuteWindow = 15;
SET @DepartureDT = '2014-04-30 10:00:00';
SET @DepartureDTMinBef = @DepartureDT - INTERVAL @MinuteWindow MINUTE;
SET @DepartureDTMinAft = @DepartureDT + INTERVAL @MinuteWindow MINUTE;
SELECT * FROM TrainEvents
WHERE station_dep_id = 29
AND dep_dt >= @DepartureDTMinBef;
AND dep_dt <= @DepartureDTMinAft;
EPILOGUE
Given your initial idea, you could change the station_arr_id and station_dep_id into a single route_id and store the routes in a route table
CREATE TABLE TrainRoutes
(
station_id INT NOT NULL AUTO_INCREMENT,
station_name VARCHAR(128) NOT NULL,
...
PRIMARY KEY (station_id),
KEY station_name_ndx (station_name)
);
CREATE TABLE TrainEvents
(
id INT NOT NULL AUTO_INCREMENT,
route_id INT NOT NULL,
arr_dt DATETIME NOT NULL,
dep_dt DATETIME NOT NULL,
...
PRIMARY KEY (id),
KEY arr_index (route_id,arr_dt),
KEY dep_index (route_id,dep_dt)
);
You could also make the event record arrivals and departures separately.
CREATE TABLE TrainRoutes
(
station_id INT NOT NULL AUTO_INCREMENT,
station_name VARCHAR(128) NOT NULL,
...
PRIMARY KEY (station_id),
KEY station_name_ndx (station_name)
);
CREATE TABLE TrainEvents
(
id INT NOT NULL AUTO_INCREMENT,
route_id INT NOT NULL,
event_dt DATETIME NOT NULL,
event_type TINYINT NOT NULL # 1 for arrival, 2 for departure
...
PRIMARY KEY (id),
KEY event_index1 (route_id,event_dt,event_type),
KEY event_index2 (route_id,event_type,event_dt),
);
I am just giving ideas. I'll leave it to you to implement. Perhaps, to give you some UI ideas, see the site I use to commute : http://as0.mta.info/mnr/schedules/sched_form.cfm . The result of the route you pick in that site shows route departures within a 5-hour window (2.5 hours before and 2.5 hours after).
Give it a Try !!!
Let's break this question up into a few parts.
Q: I need to insert 1mm rows a day. Is that a lot?
Not really. 1mm divided by 24 hours divided by 60 minutes divided by 60 seconds gives you about 12 inserts per second. For a rough frame of perspective, it's not unusual to see 1,000 inserts per second in typical commodity servers with no tuning.
Granted, your load won't be perfectly averaged out like that - you'll have bursts of load - but I wouldn't make database platform decisions based on less than 10k-20k inserts per second. Any platform out there will work fairly well.
Q: How should I structure the data?
Zoom out - don't think table, think databases. If you're going to be keeping this data permanently, and it's truly insert-only with no updates, then you probably want to start a new database for time lengths. Your inserts may only go into one table in one database, but every year, create a new database (MyApp_2015) and seal the old 2014 data as read-only. You can stop backing it up (as long as you've still got a good backup once), stop doing index maintenance, statistics updates, etc.
The PHP will only ever have to know about the current database for inserts, making your design a lot easier. The archival process becomes a DBA task much later down the road as long as you go in knowing that there will be more than one database involved.
If you were doing more than 1,000 inserts per second sustained, and you wanted easier performance management, then I'd also suggest building sharding into the initial design regardless of the database platform. Don't get me wrong, any modern database can handle over 1,000 inserts per second, but designing sharding in now just gives you more flexibility later on. At 12 inserts per second, it's just not worth the design/testing hassle.
Q: How should I do reporting?
In an ideal world, reports would not be done against the live server. Run the reports against a restored or replicated copy of the database. This does two things: it reduces load on the live server, and it validates your backups, guaranteeing that you've got your valuable data elsewhere.
Best Answer
For optimal store and later search, I would consider using two tables.
The first table would hold the event/schedule definition (frequency, internal, recurrences etc) - holding 1 line per event The second table will be maintained by a process running on the first table and calculating->adding the data for future occurrences of each event.
Based on your system requirements you can decide how far ahead the second table should be filled.
This will make it easy to later query a certain events running within a specific time frame.
This solution should be more elegant than trying to calculate the date range on-the-fly, which is more complex and definitely slower in performance.