MySQL Performance – Issues Extracting Scatter Plot Data

join;MySQLperformancequery-performance

First off: Sorry for the title, couldn't think of anything better. Feel free to change it though.

Ok, assume I have the following table:

 -----------------------------------------------------------------------------------
| id | category | value | foreign_id_1 | foreign_id_2 | foreign_id_3 | foreign_id_4 |
 -----------------------------------------------------------------------------------
|  1 | A        |     4 | 1            | 1            | 1            | 1            |
|  2 | B        |     9 | 1            | 1            | 1            | 1            |
|  3 | A        |     5 | 2            | 2            | 2            | 2            |
|  4 | A        |     6 | 3            | 3            | 3            | 3            |
|  5 | B        |     8 | 3            | 3            | 3            | 3            |
|    |          |       |              |      ...     |              |              |
 -----------------------------------------------------------------------------------

Now I want to create a scatter plot for the categories (A as x, B as y based on the value column).
The desired output would be:

 ----------------------
| foreign_id_1 | x | y |
 ----------------------
| 1            | 4 | 9 |
| 3            | 6 | 8 |
 ----------------------

Row 3 from the original table isn't contained, because there only is a value for A and not for B.

What I've tried so far is the following:

SELECT A.foreign_id_1, A.value as x, B.value AS y
FROM
  (
    SELECT value, foreign_id_1, foreign_id_2, foreign_id_3, foreign_id_4
    FROM table
    WHERE category = 'A'
  ) AS A
JOIN
  (
    SELECT value, foreign_id_1, foreign_id_2, foreign_id_3, foreign_id_4
    FROM table
    WHERE category = 'B'
  ) AS B
ON
  (
    A.foreign_id_1 = B.foreign_id_1
    AND A.foreign_id_2 = B.foreign_id_2
    AND A.foreign_id_3 = B.foreign_id_3
    AND A.foreign_id_4 = B.foreign_id_4
  )

That all works fine, but it's really slow (30s with my current data set and that might expand in the future). The problem seems to be that the only way to identify the "matching rows" (the ones that form x and y of a data point) is expensive, because all foreign keys have to match.

Now to the question:

Assuming that I cannot change the table structure, is there a way to make this query faster? Or maybe there is a different way to get the desired result that I couldn't think of?

Best Answer

There is no reason to use derived tables for this query. It can affect performance badly with MySQL's primitive optimizer. Try this query:

SELECT 
    a.foreign_id_1, a.value AS x, b.value AS y
FROM
    table AS a
  JOIN
    table AS b
      ON  a.foreign_id_1 = b.foreign_id_1
      AND a.foreign_id_2 = b.foreign_id_2
      AND a.foreign_id_3 = b.foreign_id_3
      AND a.foreign_id_4 = b.foreign_id_4
WHERE
      a.category = 'A'
  AND b.category = 'B' ;

If the efficiency of the query is crucial, you can add a covering index, with the columns in tailor made order, especially for the query:

ALTER TABLE tablename
  ADD INDEX category_foreign_ids_value          -- pick a name for the index
                                                              -- columns from:
    (category,                                                -- the WHERE 
     foreign_id_1, foreign_id_2, foreign_id_3, foreign_id_4,  -- the JOIN ... ON
     value) ;                                                 -- the SELECT

Related Solutions

Store categories in database or use hard coded integers

Performance is an excellent thing to consider but I find it helpful to worry about other costs like data quality. Referential integrity ensures that no values would be assigned that don't have a match in the list of values table. Everyone knows 1 is mailing, 2 is billing, piece of pie. FNG comes on board and makes a code change and oops, we now have addresses with an address type of 12. And we've had that problem for 2 months now, customers are angry that they have not received their product and they're filling complaints with the BBB. RI would have caused the app to blow up when they tried to push those bad values into the database.

"Nobody's going to make that mistake here." Fine, the app is a success and brings in lots of money. Eventually someone's going to want to know more about what's going on. We want reports and yes, we also said we need all these new features done by last week. So now everywhere where you've embedded that knowledge that 1 is billing and 2 is mailing you need to repeat in the reporting thing (Excel, traditional reporting systems, ad-hoc queries, etc) you give to the end users to make them happy. Now you've got that logic in two places so any changes to that logic has to be propagated across two systems.

Each time you carry logic like that across systems just adds to your technical debt. Maybe your system is never going to grow like that and it's fine to embed the logic in the app code but I've found the quickest way to achieve immortality is to implement a quick fix.

Oh and did you notice I switched up the values in the report example? The FNG who you brought in could make a rookie mistake like that and who knows how long it's going to take to catch the mistake.

Mysql – How to sort category by path and sort order

The problem is you need to extract the sort order for the root items and make that same order apply to the sub-items in the tree without re-ordering the sub-items. In Oracle you can do this with a windowing function as follows:

SELECT RPAD('- ',Length(Path)-1,'- ') || label, levelx, path, sort_order
   , min(sort_order) OVER (PARTITION BY FirstLevel) Sort_Order2
FROM (
   SELECT id, parent_id, label, levelx, path, sort_order
      , substr(path,2,1) FirstLevel 
   FROM t1 WHERE Label IS NOT NULL
)
ORDER BY Sort_Order2, Path;

I don't know if you can do something like that in MYSQL, so here is a version using a GROUP BY and self join that works in MySQL 5.5.28.

SELECT Label
FROM T1 a
JOIN
   (SELECT SUBSTR(Path,2,1) FirstLevel, MIN(Sort_Order) FirstSort FROM T1 
      WHERE Label IS NOT NULL GROUP BY SUBSTR(Path,2,1)) b
   ON substr(a.Path,2,1) = b.FirstLevel
ORDER BY FirstSort, Path;

(SQL Fiddle)

Best Answer

Related Solutions

Store categories in database or use hard coded integers

Mysql – How to sort category by path and sort order

Related Question