Conclusion and a workaround
After exhausting all options on Windows, I decided to switch to Linux, mostly because I was frustrated with inability to profile and debug in detail.
I have moved the whole setup to Ubuntu 14.04. I first tried XAMPP but gave up because of conflicts between XAMPP and MySQL and MySQL Workbench. Then I moved to vanilla MySQL (5.5, I think) and vanilla Apache 2.
However, I was still left with the same problem – no visible bottleneck and resources still underutilized. I suspected throttling in TCP sockets (used between Perl code and MySQL), but further profiling proved this not to be the case.
Then, I turned my attantion to Perl DBI module DBD::SQL, thinking that it may be doing some throttlinig. I did some tests where I replaced DBI calls in Perl with system calls (system("mysql -e'INSERT INTO blah blah …'). I have determined that the performance did not change, therefore absolving DBI as a culprit.
I need to add one important detail now: I was in fact always running a number of my Perl scripts in parallel. Given that the CPU has 8 cores, this is necessary to utilize all of them, of course.
Further debugging showed that almost all my perl processes which were supposed to work furiously
were sleeping most of the time. Ubunty System Monitor shawed them as waiting on Waiting Channels wait_answer_interruptible or unix_stream_recvmsg. CPU History graph in System Monitor showed all perl processes jumping to 100% CPU utilization and then dropping to ~0% in unison. I suspected that MySQL server is not configured for multi threading, but htop showed 17 mysqld threads activated, confirming that all should be ok.
I suspected that all MySQL threads were waiting on the same semaphore and were locked out for most of the time. I dreaded delving into the dark bowels of MySQL trying to figure out what goes on inside. Instead, I decided to replace MySQL with MariaDB, even though MariaDB seems to have had the same issue originally when I was running it on Windows.
Lo and behold – this finally worked. My perl scripts were screaming.
One last problem remained: I had a very rudimentary method of parallelising the perl scripts: I would just run 10 or 20 with their respective loads and hope that they would utilize all the resources.
This has obvious drawbacks: if too many processes are spawned, the OS may spend too much time swapping them (although not a serious issue with only 20 processes, it becomes an issue with e.g. 1000). If not enough processes are spawned (e.g. less than 8, for each core) the CPU will not be utilized fully for sure. If too many processes exhaust RAM, Linux will turn to disk and will start swapping. As soon as this starts happening, everything grinds to a halt.
I searched but could not find a perl library/script/code which would spawn new processes only when CPU, memory and disk are under utilized. Hence I created my own: raspawn.pl (resource aware spawn) which I placed on github. Raspawn.pl spawns a number of processes while trying to keep resources utilization just below the maximum. It constantly checks the CPU, memory and disk utilization and only if all are less than ~90% utilized, starts a new process.
Finally, this worked. I can now process my whole load in around 7 days, instead of many months...
OK, I ended up adding another lookup table:
CREATE TABLE IF NOT EXISTS `stops_routes` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`stop_id` varchar(100) NOT NULL,
`route_id` varchar(100) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `stop_route` (`stop_id`,`route_id`),
KEY `stop_id` (`stop_id`),
KEY `route_id` (`route_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Filling it was fairly fast:
mysql> insert into stops_routes (stop_id, route_id)
->
-> select
-> s.stop_id,
-> r.route_id as from_route_id
->
-> from routes r
-> left join trips t on t.route_id = r.route_id
-> left join stop_times st on st.trip_id = t.trip_id
-> left join stops s on s.stop_id = st.stop_id
-> group by s.stop_id, r.route_id;
Query OK, 3496 rows affected (8.38 sec)
Records: 3496 Duplicates: 0 Warnings: 0
Using it is blazingly fast:
mysql> select
-> r.route_id as from_route_id,
-> c_sr.route_id as to_route_id
->
-> from routes r
->
-> left join stops_routes sr on sr.route_id = r.route_id
-> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
-> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
->
-> where r.route_id <> c_sr.route_id
-> group by r.route_id, c_sr.route_id
-> limit 10;
+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
| 0001 | 0002 |
| 0001 | 0003 |
| 0001 | 0004 |
| 0001 | 0005 |
| 0001 | 0006 |
| 0001 | 0008 |
| 0001 | 0009 |
| 0001 | 0011 |
| 0001 | 0014 |
| 0001 | 0031 |
+---------------+-------------+
10 rows in set (0.63 sec)
Now I can fill my last lookup table (set of connections between every routes on my GTFS network):
mysql> insert into route_connections (from_route_id, to_route_id)
-> select
-> r.route_id as from_route_id,
-> c_sr.route_id as to_route_id
->
-> from routes r
->
-> left join stops_routes sr on sr.route_id = r.route_id
-> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
-> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
->
-> where r.route_id <> c_sr.route_id
-> group by r.route_id, c_sr.route_id;
Query OK, 2848 rows affected (0.31 sec)
Records: 2848 Duplicates: 0 Warnings: 0
Amazingly fast. I guess the engine couldn't break up the steps to optimize this.
I'd still be interested to know if it would be possible to get the same result (from route to route connections table) using only one sub-second or sub-minute query.
Best Answer
I suspect the combined query is not doing what you think it is doing.
Add parentheses to make it clear whether the second
ORDER BY
belongs to the secondSELECT
or to theUNION
. Ditto for theLIMIT
.For further discussion, Please provide