Mysql – How to optimize performance on a MySQL ndbcluster

MySQLoptimizationperformance

So, this is what I have:

A simple table classes__to_students:

|class   |student|
---------------
 Math     Alice
 Math     Bob
 Math     Peter
 Math     Anne
 Music    Bob
 Music    Chis
 Music    Debbie
 ...

There's 1000000 classes. Every student attends 500 classes.
(Bad analogy, I know…)
For testing, there's 4000 students in the DB (so there is 2.000.000 rows), but this DB should handle several million students, that's why I'm using a MySQL ndbcluster.

Anyway… a query like this:

SELECT student, COUNT(class) as common_classes
FROM classes_table
WHERE class IN (my_subject_list)
GROUP BY student
ORDER BY common_classes DESC

resulting in something like:

|student |commonClasses|
 Brad     6
 Melissa  4
 Chris    3
 Bob      3
 ...

takes about 1 second with a InnoDB engine on one server, which is ok. On a ndbcluster with 3 datanodes the same query takes up to 10 seconds, which is far too much. I don't know how the above statement is treated internally, but I guess that there is a lot of communication between the nodes, which makes it slow due to latency.
Can someone tell me, what happens in the cluster when I perform this query? How can I make it faster?

Note: this is a question that came up after I posted this question:
https://stackoverflow.com/questions/9095054/how-do-i-compute-a-ranking-with-mysql-stored-procedures
For more informations, have a look there!

Best Answer

First of all, Cluster is a storage engine. It doesn't actually execute queries because it doesn't speak SQL. That is why you use a MySQL server, which parses your queries and sends low-level storage engine API calls to the Cluster data nodes. The data nodes know how to retrieve or store data. Or you can talk to the data nodes directly using the NDB API(s).

MySQL Cluster has various means of executing queries. They boil down to:

Primary key lookup
Unique key lookup
Ordered index scan (i.e., non-unique indexes that use T-trees)
Full table scan

Let's say you have 4 data nodes in your cluster (NoOfReplicas=2). This means you have 2 node groups and each one has half the data. Cluster uses a hash on the primary key (unless you've controlled the partitioning using the 5.1 partitioning features). So for any table, half the rows are in one node group and half the rows are the in other node group.

Now for the 4 types of query execution. You can verify which type of execution is used with EXPLAIN. Here's how each ones works:

Primary key lookup - the MySQL server can calculate the hash on the primary key and know which data node the relevant row resides in. Then the MySQL server contacts that data node and receives the row.
Unique key lookup - the MySQL server cannot know which data node the row might be stored in. So it contacts a pseudo-random node. That data node has a hidden table that maps the unique key values to the primary key. Then the hash on the primary key reveals where the row resides.
Ordered index scans are sent in parallel to all data nodes, where they search their local t-tree.
Full table scans are send in parallel to all data nodes, where they scan their primary fragment.

Summary: primary key lookups are best. If you have more than 2 nodes, throughput goes up because all nodes are actively serving different fragments of data. Ordered index lookup and full table scans are done in parallel, so more nodes leads to better performance.

Related Solutions

Mysql – How to optimize this MySQL query further

After looking over the query, the tables, and the WHERE AND GROUP BY clauses, I recommend the following:

Recommendation #1) Refactor the Query

I reorganized the query to do three(3) things:

create smaller temp tables
Process the WHERE clause on those temp tables
Delay joining to the very last

Here is my proposed query:

SELECT
  sounds.*,srkeys.avg_rating,srkeys.votes
FROM
(
  SELECT AA.id,avg(BB.rating) AS avg_rating, count(BB.rating) AS votes
  (
    SELECT id FROM sounds
    WHERE blacklisted = false 
    AND   ready_for_deployment = true 
    AND   deployed = true 
    AND   type = "Sound" 
    AND   created_at > '2011-03-26 21:25:49'
  ) AA INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) BB
  ON AA.id = BB.rateable_id
  GROUP BY BB.rateable_id
) srkeys INNER JOIN sounds USING (id);

Recommendation #2) Index the sounds table with an index that will accommodate the WHERE clause

The columns of this index include all the columns from the WHERE clause with static values first and moving target last

ALTER TABLE sounds ADD INDEX support_index
(blacklisted,ready_for_deployment,deployed,type,created_at);

I sincerely believe you will be pleasantly surprised. Give it a Try !!!

UPDATE 2011-05-21 19:04

I just saw the cardinality. OUCH !!! Cardinality of 1 for rateable_id. Boy, I feel stupid !!!

UPDATE 2011-05-21 19:20

Maybe making the index will be enough to improve things.

UPDATE 2011-05-21 22:56

Please run this:

EXPLAIN SELECT
  sounds.*,srkeys.avg_rating,srkeys.votes
FROM
(
  SELECT AA.id,avg(BB.rating) AS avg_rating, count(BB.rating) AS votes FROM
  (
    SELECT id FROM sounds
    WHERE blacklisted = false 
    AND   ready_for_deployment = true 
    AND   deployed = true 
    AND   type = "Sound" 
    AND   created_at > '2011-03-26 21:25:49'
  ) AA INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) BB
  ON AA.id = BB.rateable_id
  GROUP BY BB.rateable_id
) srkeys INNER JOIN sounds USING (id);

UPDATE 2011-05-21 23:34

I refactored it again. Try This One Please:

EXPLAIN
  SELECT AA.id,avg(BB.rating) AS avg_rating, count(BB.rating) AS votes FROM
  (
    SELECT id FROM sounds
    WHERE blacklisted = false 
    AND   ready_for_deployment = true 
    AND   deployed = true 
    AND   type = "Sound" 
    AND   created_at > '2011-03-26 21:25:49'
  ) AA INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) BB
  ON AA.id = BB.rateable_id
  GROUP BY BB.rateable_id
;

UPDATE 2011-05-21 23:55

I refactored it again. Try This One Please (Last Time):

EXPLAIN
  SELECT A.id,avg(B.rating) AS avg_rating, count(B.rating) AS votes FROM
  (
    SELECT BB.* FROM
    (
      SELECT id FROM sounds
      WHERE blacklisted = false 
      AND   ready_for_deployment = true 
      AND   deployed = true 
      AND   type = "Sound" 
      AND   created_at > '2011-03-26 21:25:49'
    ) AA INNER JOIN sounds BB USING (id)
  ) A INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) B
  ON A.id = B.rateable_id
  GROUP BY B.rateable_id;

UPDATE 2011-05-22 00:12

I hate giving up !!!!

EXPLAIN
  SELECT A.*,avg(B.rating) AS avg_rating, count(B.rating) AS votes FROM
  (
    SELECT BB.* FROM
    (
      SELECT id FROM sounds
      WHERE blacklisted = false 
      AND   ready_for_deployment = true 
      AND   deployed = true 
      AND   type = "Sound" 
      AND   created_at > '2011-03-26 21:25:49'
    ) AA INNER JOIN sounds BB USING (id)
  ) A,
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
    AND AAA.rateable_id = A.id
  ) B
  GROUP BY B.rateable_id;

UPDATE 2011-05-22 07:51

It has been bothering me that ratings is coming back with 2 million rows in the EXPLAIN. Then, it hit me. You might need another index on the ratings table which starts with rateable_type:

ALTER TABLE ratings ADD INDEX
rateable_type_rateable_id_ndx (rateable_type,rateable_id);

The goal of this index is to reduce the temp table that manipulates ratings so that it is less that 2 million. If we can get that temp table significantly smaller (at least half), then we can have a better hope in your query and mine working faster too.

After making that index, please Retry my original proposed query and also try yours:

SELECT
  sounds.*,srkeys.avg_rating,srkeys.votes
FROM
(
  SELECT AA.id,avg(BB.rating) AS avg_rating, count(BB.rating) AS votes
  (
    SELECT id FROM sounds
    WHERE blacklisted = false 
    AND   ready_for_deployment = true 
    AND   deployed = true 
    AND   type = "Sound" 
    AND   created_at > '2011-03-26 21:25:49'
  ) AA INNER JOIN
  (
    SELECT AAA.ratings,AAA.rateable_id
    FROM ratings AAA
    WHERE rateable_type = 'Sound'
  ) BB
  ON AA.id = BB.rateable_id
  GROUP BY BB.rateable_id
) srkeys INNER JOIN sounds USING (id);

UPDATE 2011-05-22 18:39 : FINAL WORDS

I had refactored a query in a stored procedure and added an index to help answer a question on speeding things up. I got 6 upvotes, had the answer accepted,and picked up a 200 bounty.

I had also refactored another query (marginal results) and added an index (dramatic results). I got 2 upvotes and had the answer accepted.

I added an index for yet another query challange and was upvoted once

and now your question.

Wanting to answers all questions like these (including yours) were inspired by a YouTube video I watched on refactoring queries.

Thank you again, @coneybeare !!! I wanted to answer this question to fullest extent possible, not just accept points or accolades. Now, I can feel that I earned the points !!!

Mysql – Trying to optimize query for performance (MySQL)

Your original query joins everything together then traverses 50000 rows into the joined data before presenting the next 100 rows. Try this

Get the 100 keys from ProjectManagement2 upfront
join everything afterwards

Here is my proposed query

SELECT i.id, i.`key`, i.title, i.description,
CONCAT(ru.firstName, ' ', ru.lastName) as `name`,
CONCAT(au.firstName, ' ', au.lastName) as `name`,
p.title, pc.title, pva.title, pvo.title, pvf.title, i.durationEstimate,
i.storyPoints, i.dueDate, isl.title, i.rejectionCount,
CONCAT(uc.firstName, ' ', uc.lastName) as `name`,
i.createdTimestamp, i.updatedTimestamp, it.title, isss.title
FROM
(
    SELECT B.* FROM
    (SELECT id FROM ProjectManagement2 LIMIT 50000, 100) A
    INNER JOIN ProjectManagement2 B USING (id)
) i
INNER JOIN Users ru ON ru.id = i.reporterUserUsername
INNER JOIN Users au ON au.id = i.assignedUserUsername
INNER JOIN Projects p ON p.id = i.projectTitle
INNER JOIN ProjectComponents pc ON pc.id = i.projectComponentTitle
INNER JOIN ProjectVersions pva ON pva.id = i.affectedProjectVersionTitle
INNER JOIN ProjectVersions pvo ON pvo.id = i.originalFixedProjectVersionTitle
INNER JOIN ProjectVersions pvf ON pvf.id = i.fixedProjectVersionTitle
INNER JOIN IssueSecurityLevels isl ON isl.id = i.issueSecurityLevelId
INNER JOIN IssueTypes it ON it.id = i.issueTypeId
INNER JOIN IssueStatuses isss ON isss.id = i.issueStatusId
INNER JOIN Users uc ON uc.id = i.creatorUserUsername
;

Give it a Try !!!

Best Answer

Related Solutions

Mysql – How to optimize this MySQL query further

Mysql – Trying to optimize query for performance (MySQL)

Related Question