MySQL join on hamming distance

join;MySQL

I have a table t1 with 2 columns, col1 has all possible combinations of strings of length 13 with alphabets of {1,2,3} and col2 has some integer X. Thus table has 3^13 rows in total.

For each of the strings I would need to know which other strings are exactly within a hamming distance of say 3 and get the sum of col2. So each of the strings have 2288 ( (13,3)*2^3) other strings with a hamming distance of 3.

My solution to this was to create table t2 with all possible mappings for each of the strings. So this table now has 3^13*2288 rows in it with columns col1 similar to t1 and col2 has all the strings within desired distance of string in col1. Then I would index both of tables on col1 and join them together on col1 and sum col2 on tbl1 and group over col2 in t2.

SELECT t2.col2,sum(t1.col2) FROM t1 JOIN t2 ON t1.col1=t2.col1 GROUP BY t2.col2

Problem is that when I try to execute this query it runs for ages and after 10 mins I get a timeout message from server.

My server is community version of MySQL 5.7 with standard configuration.

Whats wrong with this solution and/or configurations because it takes so long process this?

Best Answer

The reason why your query is so slow to execute, is because MySQL has to create a temporary table to sort t2 for grouping.

However if you create an index on the grouping column, MySQL doesn't need to create a temporary table, it can just use the index to access the rows in the correct order.

In this case I would suggest to use the primary key for this, saving the overhead of an additional index on disk. The data will be saved in the correct order already.

On t2 set the primary key to (col2, col1) (order is important here). t1 would just have the primary key as (col1).

To run your query try adding an ORDER BY NULL clause as well. This will prevent MySQL from doing an implicit order by during grouping.

You will probably still hit the 10 min timeout - depending on the hardware you use. You might try to use the SELECT ... INTO OUTFILE (docs) syntax to prevent sending the 1.5 million rows to the client. This query will then still finish even with your client connection timing out.

To save some disk space you should also think about adding an integer column as the primary key to t1 and reference this key only in both columns of t2, instead of saving the whole varchar twice there as well.

Related Solutions

Mysql – Pair of 2-tuple composite FKs pointing to same table (closure table pattern)

This is a known limitation of InnoDB. It was reported as a bug way back in 2005:
Bug #11305 On Update Cascade not working with two references to the same field

The bug was acknowledged but not fixed. The following was added to the manual section http://dev.mysql.com/doc/refman/5.6/en/create-table-foreign-keys.html:

Between two tables, do not define several ON UPDATE CASCADE clauses that act on the same column in the parent table or in the child table.

Based on a workaround suggested in the bug, I got the update to work:

mysql> SET SESSION foreign_key_checks=0;

mysql> UPDATE father  SET `col1` = 'UPDATE_MY_ID_PLEASE'  WHERE `col1` = 'A' AND `col2` = 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> UPDATE closure SET `a-col1`='UPDATE_MY_ID_PLEASE' WHERE `a-col1` = 'A' AND `a-col2` = 1;
Query OK, 2 rows affected (0.00 sec)
Rows matched: 2  Changed: 2  Warnings: 0

mysql> UPDATE closure SET `d-col1`='UPDATE_MY_ID_PLEASE' WHERE `d-col1` = 'A' AND `d-col2` = 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> SELECT * FROM father ; SELECT * FROM closure;
+---------------------+------+------+
| col1                | col2 | col3 |
+---------------------+------+------+
| A                   |    2 | A2   |
| B                   |    1 | B1   |
| B                   |    2 | B2   |
| UPDATE_MY_ID_PLEASE |    1 | A1   |
+---------------------+------+------+
4 rows in set (0.00 sec)

+---------------------+--------+---------------------+--------+--------+
| a-col1              | a-col2 | d-col1              | d-col2 | length |
+---------------------+--------+---------------------+--------+--------+
| A                   |      2 | A                   |      2 |      0 |
| A                   |      2 | B                   |      2 |      1 |
| UPDATE_MY_ID_PLEASE |      1 | B                   |      1 |      1 |
| UPDATE_MY_ID_PLEASE |      1 | UPDATE_MY_ID_PLEASE |      1 |      0 |
+---------------------+--------+---------------------+--------+--------+

Mysql – Store equations/formula in thesql db table

You can try to leverage dynamic SQL.

If you need to get a calculated value for an id

DELIMITER $$
CREATE PROCEDURE get_value(IN _id INT)
BEGIN
  SET @sql = NULL;

  SELECT CONCAT('SELECT ', formula, ' value FROM table1 WHERE id = ', 1)
    INTO @sql
    FROM table1
   WHERE id = 1;

  PREPARE stmt FROM @sql;
  EXECUTE stmt;
  DEALLOCATE PREPARE stmt;
  SET @sql = NULL;
END$$
DELIMITER ;

Note: You can of course use OUT parameter instead of returning the resultset if you want to.

Sample usage:

CALL get_value(1);

Sample output:

|           VALUE |
|-----------------|
| 82.916129032258 |

Here is how a procedure might look like to get all values calculated by formulas

DELIMITER $$
CREATE PROCEDURE get_values()
BEGIN
  SET @sql = NULL;

  SELECT GROUP_CONCAT(CONCAT(
    'SELECT id, ', formula, ' value FROM table1 WHERE id = ', id)
    ORDER BY id SEPARATOR ' UNION ALL ')
    INTO @sql
    FROM table1;

  PREPARE stmt FROM @sql;
  EXECUTE stmt;
  DEALLOCATE PREPARE stmt;
  SET @sql = NULL;
END$$
DELIMITER ;