Mysql – Strategy to scale a comparison table query and load time

index-tuningMySQLscalability

given a product table

PRODUCT
id   some_column
1    10
2     2
3     8

when comparing all them I get a triangular matrix that I load from a csv file

COMPARISION
id  product_left  product_right   result
1              1              2        3
2              1              3        8  
3              2              3        2

where I want to perform a query that returns all the best n comparisons for a given product

SELECT * FROM COMPARISION WHERE ( product_left = 1 OR product_right = 1 ) ORDER BY result DESC LIMIT n

I've made an index for product_left and another for product_right column

But I'm not convinced that it will scale for tens of thousands products

ie:
20k products will produce ~200 million comparisons
the load takes 15 hours in regular desktop computer
and it will get slower to load with more indexes

so, which should be the best strategy to scale it?

partitioning
do a different query
index the result column to help order by clause
…

you can find the solution here
https://github.com/rubentrancoso/cosinsimilarity

Best Answer

This will probably run faster:

( SELECT * FROM Comparision WHERE product_left  = 1
           ORDER BY result DESC  LIMIT n
)
UNION DISTINCT
( SELECT * FROM Comparision WHERE product_right = 1
           ORDER BY result DESC  LIMIT n
)
ORDER BY result DESC  LIMIT n

If there is no change of fetching the same row from the two SELECTs, change to UNION ALL to be a little faster.

Have these two 'composite' indexes:

INDEX(product_left,  result)
INDEX(product_right, result)

Yes, the ORDER BY and LIMIT are duplicated. If you also need OFFSET, then it gets more complex, but still possible.

Why? With the indexes I gave you, each SELECT will optimally filter (WHERE) and sort (ORDER BY), thereby touch only n rows. Then the UNION puts together the 2*n rows, re-sorts and delivers the desired n.

With just INDEX(product_left) it will gather all the 1 rows, sort them and peel off n -- slower.

Without the UNION, the query will simply scan the entire table, ignoring your indexes (or any others). (OK, there is a chance of some index being usable, but what I provided is better.) Use EXPLAIN SELECT ... to see what is going on. (Ask for help it it is not obvious.)

Related Solutions

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

The way this is designed you only have suboptimal choices. Random GUIDs are not well suited as clustered index keys, since they are neither small (which affects the size of all secondary indexes) nor sequential (unless you can use NEWSEQUENTIALID()) which leads to index fragmentation, which leads to wasted space, slower insert performance and slower query performance through more I/O.

On the other hand, if your normalized tables are linked by such a GUID then each join depends on them and you will have to bite the bullet and use them as primary keys with clustered index anyway. Just create the PRIMARY KEY constraint and the clustered index in separate steps so you can define PAD_INDEX = ON and FILLFACTOR=50 to slow down the fragmentation somewhat. Still, expect to do regular, expensive index REBUILDs to reduce the inevitable fragmentation.

Your secondary indexes must not start with the id, because that renders them useless! Imagine a telephone book, where each entry is given a random or running id, then the phone book is sorted by that id plus the name. Have fun searching a given name in that. A useable index must start with the column that is used in the where- or join clause.

So, with the clustered indexes created so far you cover queries of the type

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id

The query runs through the products, can efficiently look up the store ids and has immediate access to the store name, since the store id index is clustered.

Now you want to do this:

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id
WHERE p.productname LIKE 'A%'

For this you need a nonclustered index with just productname as the key column (and optionally storeid as included column, if you do frequent range searches on productname).

OK, what about the reverse case?

SELECT p.productname, s.name as StoreName 
FROM Store s
INNER JOIN Products p ON p.storeid = s.id
WHERE s.name = 'My little cornershop'

For this, you need two additional indexes: One nonclustered on store with the name column and one nonclustered on products with storeid as the column. SQL Server can efficiently find the store record (expecting only one record), then through the second index can find all product entries for this store (still only a few compared to all entries in product), then for each of these products go through the clustered index (the clustered index key is automatically part of each nonclustered index) to get to the productname column.

I hope you see the pattern here. Create a nonclustered index for each column that gets queried with a high selectivity (meaning that only a small subset of all the rows will be selected).

The row-columns are completely useless in this scenario, just drop them to save space.

Using client generated GUIDs is attractive from the client point of view. You can create coherent datasets (such as a new customer including his first order) and push them to the database without caring for the correct INSERT order and without having to read database generated ids afterwards to update your object model. But you pay a nontrivial performance price for this when it comes to getting the data back from the database, as I hopefully made clear above. The large primary key (8 bytes) gets added to each nonclustered index and blows up its size, and you get a heavily fragmented clustered index which is never good.

Using IDENTITY values for primary keys has disadvantages at INSERT time, but pays off every time after that.

PostgreSQL Index Tuning – Compound Index on Large Tables

In a GiST index, the order of columns has a different significance than in a B-tree index. Per documentation:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.

In short: put the most selective columns first.

Your EXPLAIN output shows that the condition on pid is more selective (rows=7836) than the one on outline (rows=63112). If that can be generalized (a single example may be misleading) I suggest this alternative:

CREATE INDEX inventory_compound_idx ON portal.inventory USING gist (pid, outline);

If most of your (important) queries include conditions on both columns, a multicolumn index may serve you well. Else, single columns may be better overall.

Table layout

This is an educated guess since I don't have complete information.

Don't use oid as column name. It's easy to confuse with the OID.
Don't use the name date for a timestamp column. Or rather: don't use the name date for any column, don't use names of base-types for identifiers at all. Can lead to confusing mistakes and error messages.
Create a lookup table for types and only put a small integer type_id into the big table. Pack fixed-length types tightly so not to waste space to padding. Details.
I prefer the type text (or varchar without length limit) over varchar(n). Details.

For example:

CREATE TABLE portal.inventory (
   inventory_id bigint PRIMARY KEY
  ,type_id      integer NOT NULL REFERENCES inv_type(type_id)
  ,pid          integer NOT NULL
  ,size         bigint NOT NULL
  ,ts           timestamp NOT NULL
  ,outline      geography(Polygon,4326)
  ,product_name text
  ,path         text
);

Best Answer

Related Solutions

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

PostgreSQL Index Tuning – Compound Index on Large Tables

Table layout

Related Question