Mysql improve Self join performance large table

join;mysql-5.5performance

I have a task to help identify conflicting IP ranges , I want to highlight when one companies IP range falls within another's … below is the sample table,

I have converted the IP ranges from .dot notation to their equivalent integer values (corp1_start… corp1_end):

+-----------+---------------------------------+--------------------+-------------+-------------+------------+------------+
| client_id | corp1                           | corp2              | corp1_start | corp2_start | corp1_end  | corp2_end  |
+-----------+---------------------------------+--------------------+-------------+-------------+------------+------------+
|      4263 | AlTel Lucent                    | EverGreen          |   212451424 |   212451456 |  212451439 |  212451583 |
|       308 | AlTel Lucent                    | Alcatel Lucent 1   |  2265448448 |  2279342080 | 2265710591 | 2281701375 |
|       308 | AlTel Lucent                    | AT&T 09 Corporate  |  2265448448 |  2277834752 | 2265710591 | 2278621183 |
|       308 | AlTel Lucent                    | AT&T 02 Corporate  |  2265448448 |  2278948864 | 2265710591 | 2279014399 |
|      2488 | Vassar College                  | Walmart            |  2414149632 |  2399207424 | 2414215167 | 2401828863 |
|      1417 | US House of Representatives     | Walmart            |  2414280704 |  2399207424 | 2414346239 | 2401828863 |
|      4861 | IP Test 2                       | IP Test 1          |  2418813975 |  2418813952 | 2418813976 | 2418814207 |
|      4484 | EXXon Corp                      | Chevron            |  2464874496 |  2450915328 | 2464940031 | 2452553727 |
+-----------+---------------------------------+--------------------+-------------+-------------+------------+------------+

And I have a query like this that use's a SELF join to help fetch the conflicting IP ranges.:

SELECT t1.client_id,t1.corp1,'IP conflicts --->',
t1.corp2, t1.corp1_start, t2.corp2_start, t1.corp1_end,t2.corp2_end
FROM table2 AS t1
JOIN table2 AS t2
 on 
 ( 
 t1.corp1_start >= t2.corp2_start 
 AND  t1.corp1_end <= t2.corp2_end 
 AND t1.client_id <> t2.client_id 
 ) 

the actual table has over 20,000 rows with indices on both start and end ranges, and the above query never seems to end . With less than 5000 rows the query appears to run , but as soon as you get into the 15K+ rows, it becomes unresponsive..

Running EXPLAIN SELECT on the query above produces something like :

+----+-------------+-------+------+----------------+------+---------+------+-------+------------------------------------------------+
| id | select_type | table | type | possible_keys  | key  | key_len | ref  | rows  | Extra                                          |
+----+-------------+-------+------+----------------+------+---------+------+-------+------------------------------------------------+
|  1 | SIMPLE      | t2    | ALL  | corp2_ip_idx   | NULL | NULL    | NULL | 16951 |                                                |
|  1 | SIMPLE      | t1    | ALL  | corp1_ip_idx | NULL | NULL    | NULL | 16951 | Range checked for each record (index map: 0x2) |
+----+-------------+-------+------+----------------+------+---------+------+-------+------------------------------------------------+

Any suggestions on how to improve performance? Version of database is MySQL 5, table type is myISAM.. thanks…

Best Answer

Here's a stab at this: see fiddle at http://sqlfiddle.com/#!9/1fc60/1. My algorithm in general to take all the start/end values for all clients, sort those numbers and make each adjacent pair of values into an interval. For n clients, you have 2n-1 intervals (in worst case). Then, evaluate each of those intervals to see how many clients' IP ranges they fall into. If more than 1, then you found a conflicting IP range between two clients. It does works, but as you get bigger data sets, it gets considerably slower. The algorithm is Polynomial in time complexity. On modest hardware, I was able to process 16k clients for conflicts in about 6 minutes.

I think there's a much better algorithm that might take logarithmic time (good), or even linear time (way better), but this algo would be pretty tough to do in MySQL or any RDBMS. It would be much easier in an OO 3GL like Java or .NET.