Mysql – Strategy to scale a comparison table query and load time

index-tuningMySQLscalability

given a product table

PRODUCT
id   some_column
1    10
2     2
3     8

when comparing all them I get a triangular matrix that I load from a csv file

COMPARISION
id  product_left  product_right   result
1              1              2        3
2              1              3        8  
3              2              3        2

where I want to perform a query that returns all the best n comparisons for a given product

SELECT * FROM COMPARISION WHERE ( product_left = 1 OR product_right = 1 ) ORDER BY result DESC LIMIT n

I've made an index for product_left and another for product_right column

But I'm not convinced that it will scale for tens of thousands products

ie:
20k products will produce ~200 million comparisons
the load takes 15 hours in regular desktop computer
and it will get slower to load with more indexes

so, which should be the best strategy to scale it?

  • partitioning
  • do a different query
  • index the result column to help order by clause

you can find the solution here
https://github.com/rubentrancoso/cosinsimilarity

Best Answer

This will probably run faster:

( SELECT * FROM Comparision WHERE product_left  = 1
           ORDER BY result DESC  LIMIT n
)
UNION DISTINCT
( SELECT * FROM Comparision WHERE product_right = 1
           ORDER BY result DESC  LIMIT n
)
ORDER BY result DESC  LIMIT n

If there is no change of fetching the same row from the two SELECTs, change to UNION ALL to be a little faster.

Have these two 'composite' indexes:

INDEX(product_left,  result)
INDEX(product_right, result)

Yes, the ORDER BY and LIMIT are duplicated. If you also need OFFSET, then it gets more complex, but still possible.

Why? With the indexes I gave you, each SELECT will optimally filter (WHERE) and sort (ORDER BY), thereby touch only n rows. Then the UNION puts together the 2*n rows, re-sorts and delivers the desired n.

With just INDEX(product_left) it will gather all the 1 rows, sort them and peel off n -- slower.

Without the UNION, the query will simply scan the entire table, ignoring your indexes (or any others). (OK, there is a chance of some index being usable, but what I provided is better.) Use EXPLAIN SELECT ... to see what is going on. (Ask for help it it is not obvious.)