Mysql – normalization of data and duplicate rows

MySQL

All right, this is a hobby question, so nothing critical if I don't get a quick response. But you guys have been so good at helping me, I thought why not chance it again.

I am creating a db that can track late arrivals in a public transport system, and this means joining two tables (or more), and I've now being 'doing' MySQL for about ten years, and every single time, I get frustrated that I just do not seem to be able to get my head around how to manage more than one table, and I drop it for a year or so, before coming back and thinking, 'OK.. .this time, I'll crack it!'. Maybe I'm too damned stupid, and should stick to MS Access or something.

So, four tables. For the purposes of clarity, so that you can understand what I'm trying to do, I'll give you their full descriptions..

mysql> describe users ; 
+---------------+-------------+------+-----+---------+----------------+
| Field         | Type        | Null | Key | Default | Extra          |
+---------------+-------------+------+-----+---------+----------------+
| users_id      | int(11)     | NO   | PRI | NULL    | auto_increment |
| email_address | varchar(32) | NO   |     | NULL    |                |
| first_name    | varchar(16) | YES  |     | NULL    |                |
| second_name   | varchar(24) | YES  |     | NULL    |                |
+---------------+-------------+------+-----+---------+----------------+
4 rows in set (0.00 sec)

mysql> describe stations ; 
+---------------------+-------------+------+-----+---------+----------------+
| Field               | Type        | Null | Key | Default | Extra          |
+---------------------+-------------+------+-----+---------+----------------+
| stations_id         | int(11)     | NO   | PRI | NULL    | auto_increment |
| stations_short_name | varchar(6)  | NO   |     | NULL    |                |
| stations_long_name  | varchar(36) | NO   |     | NULL    |                |
+---------------------+-------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

mysql> describe delays ;
+-----------------+---------+------+-----+---------+----------------+
| Field           | Type    | Null | Key | Default | Extra          |
+-----------------+---------+------+-----+---------+----------------+
| delays_id       | int(11) | NO   | PRI | NULL    | auto_increment |
| delayed_journey | int(11) | NO   |     | NULL    |                |
| minutes_delayed | int(11) | NO   |     | NULL    |                |
+-----------------+---------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

mysql> describe journeys ;
+-----------------------+------------+------+-----+---------------------+-----------------------------+
| Field                 | Type       | Null | Key | Default             | Extra                       |
+-----------------------+------------+------+-----+---------------------+-----------------------------+
| journeys_id           | int(11)    | NO   | PRI | NULL                | auto_increment              |
| journeys_orig_station | int(11)    | NO   | MUL | NULL                |                             |
| journeys_dest_station | int(11)    | NO   | MUL | NULL                |                             |
| was_delayed           | tinyint(1) | NO   |     | NULL                |                             |
| start_time            | timestamp  | NO   |     | CURRENT_TIMESTAMP   | on update CURRENT_TIMESTAMP |
| end_time              | timestamp  | NO   |     | 0000-00-00 00:00:00 |                             |
| delay                 | int(11)    | YES  | MUL | NULL                |                             |
+-----------------------+------------+------+-----+---------------------+-----------------------------+
7 rows in set (0.00 sec)

For the moment, to test it, I put two stations into stations:

mysql> select * from stations; 
+-------------+---------------------+----------------------+
| stations_id | stations_short_name | stations_long_name   |
+-------------+---------------------+----------------------+
|           1 | abc                 | ABC Central          |
|           2 | xyz                 | XYZ Provincial       |
+-------------+---------------------+----------------------+
2 rows in set (0.00 sec)

Let's say I took a train this morning from ABC Central to XYZ Provincial, and that it was delayed by six minutes…

mysql> select * 
       from journeys ;
+-------------+-----------------------+-----------------------+-------------+---------------------+---------------------+-------+
| journeys_id | journeys_orig_station | journeys_dest_station | was_delayed | start_time          | end_time            | delay |
+-------------+-----------------------+-----------------------+-------------+---------------------+---------------------+-------+
|           2 |                     1 |                     2 |           1 | 2013-12-20 08:01:00 | 2013-12-20 08:36:00 |     6 |
+-------------+-----------------------+-----------------------+-------------+---------------------+---------------------+-------+
1 row in set (0.00 sec)

So far, so good. One journey. Now, say I want to get the station names and other stuff in there. There should be one row, shouldn't there? Because I only took one journey, yeah?

mysql> SELECT journeys.journeys_id, journeys.journeys_orig_station, 
              journeys.journeys_dest_station, journeys.delay 
       FROM journeys, stations 
       WHERE journeys.journeys_orig_station = '1' 
          AND journeys.was_delayed='1' ;

+-------------+-----------------------+-----------------------+-------+
| journeys_id | journeys_orig_station | journeys_dest_station | delay |
+-------------+-----------------------+-----------------------+-------+
|           2 |                     1 |                     2 |     6 |
|           2 |                     1 |                     2 |     6 |
+-------------+-----------------------+-----------------------+-------+
2 rows in set (0.00 sec)

Two rows. Two freakin' rows.

This has now been haunting me for the past ten years. No matter how many times I try to learn MySQL, every single time I try to use data from two different tables, but only include each row once, I get duplicate rows. So I use DISTINCT, and this works – for now. No doubt, as soon as I start to use JOIN, then it'll all go pear-shaped again, and I'll toss the O'Reilly book into the corner for another ten years.

mysql> SELECT DISTINCT journeys.journeys_id, journeys.journeys_orig_station, 
                       journeys.journeys_dest_station, journeys.delay 
       FROM journeys, stations 
       WHERE journeys.journeys_orig_station = '1' 
         AND journeys.was_delayed='1' ;

+-------------+-----------------------+-----------------------+-------+
| journeys_id | journeys_orig_station | journeys_dest_station | delay |
+-------------+-----------------------+-----------------------+-------+
|           2 |                     1 |                     2 |     6 |
+-------------+-----------------------+-----------------------+-------+
1 row in set (0.00 sec)

But how do I get the names of the stations in there, rather than their ID? And why is the default for start time set to CURRENT_TIMESTAMP? And is having delays actually in the journeys table a good idea? Is there any way to get the tables better organized?

Best Answer

You haven't specified any join term between journeys and stations. You need something like this:

SELECT
    J.journeys_id, S1.stations_short_name AS OriginStation, S2.stations_short_name AS DestinationStation
FROM
    journeys AS J
    INNER JOIN stations AS S1 ON J.journeys_orig_station = S1.stations_id
    INNER JOIN stations AS S2 ON J.journeys_dest_station = S2.stations_id
WHERE
        J.journeys_id = 1
    AND J.was_delayed = 1