MySQL join on hamming distance

join;MySQL

I have a table t1 with 2 columns, col1 has all possible combinations of strings of length 13 with alphabets of {1,2,3} and col2 has some integer X. Thus table has 3^13 rows in total.

For each of the strings I would need to know which other strings are exactly within a hamming distance of say 3 and get the sum of col2. So each of the strings have 2288 ( (13,3)*2^3) other strings with a hamming distance of 3.

My solution to this was to create table t2 with all possible mappings for each of the strings. So this table now has 3^13*2288 rows in it with columns col1 similar to t1 and col2 has all the strings within desired distance of string in col1. Then I would index both of tables on col1 and join them together on col1 and sum col2 on tbl1 and group over col2 in t2.

SELECT t2.col2,sum(t1.col2) FROM t1 JOIN t2 ON t1.col1=t2.col1 GROUP BY t2.col2

Problem is that when I try to execute this query it runs for ages and after 10 mins I get a timeout message from server.

My server is community version of MySQL 5.7 with standard configuration.

Whats wrong with this solution and/or configurations because it takes so long process this?

Best Answer

The reason why your query is so slow to execute, is because MySQL has to create a temporary table to sort t2 for grouping.

However if you create an index on the grouping column, MySQL doesn't need to create a temporary table, it can just use the index to access the rows in the correct order.

In this case I would suggest to use the primary key for this, saving the overhead of an additional index on disk. The data will be saved in the correct order already.

On t2 set the primary key to (col2, col1) (order is important here). t1 would just have the primary key as (col1).

To run your query try adding an ORDER BY NULL clause as well. This will prevent MySQL from doing an implicit order by during grouping.

You will probably still hit the 10 min timeout - depending on the hardware you use. You might try to use the SELECT ... INTO OUTFILE (docs) syntax to prevent sending the 1.5 million rows to the client. This query will then still finish even with your client connection timing out.

To save some disk space you should also think about adding an integer column as the primary key to t1 and reference this key only in both columns of t2, instead of saving the whole varchar twice there as well.