Mysql – How to optimize performance on a MySQL ndbcluster

MySQLoptimizationperformance

So, this is what I have:

A simple table classes__to_students:

|class   |student|
---------------
 Math     Alice
 Math     Bob
 Math     Peter
 Math     Anne
 Music    Bob
 Music    Chis
 Music    Debbie
 ...

There's 1000000 classes. Every student attends 500 classes.
(Bad analogy, I know…)
For testing, there's 4000 students in the DB (so there is 2.000.000 rows), but this DB should handle several million students, that's why I'm using a MySQL ndbcluster.

Anyway… a query like this:

SELECT student, COUNT(class) as common_classes
FROM classes_table
WHERE class IN (my_subject_list)
GROUP BY student
ORDER BY common_classes DESC

resulting in something like:

|student |commonClasses|
 Brad     6
 Melissa  4
 Chris    3
 Bob      3
 ...

takes about 1 second with a InnoDB engine on one server, which is ok. On a ndbcluster with 3 datanodes the same query takes up to 10 seconds, which is far too much. I don't know how the above statement is treated internally, but I guess that there is a lot of communication between the nodes, which makes it slow due to latency.
Can someone tell me, what happens in the cluster when I perform this query? How can I make it faster?

Note: this is a question that came up after I posted this question:
https://stackoverflow.com/questions/9095054/how-do-i-compute-a-ranking-with-mysql-stored-procedures
For more informations, have a look there!

Best Answer

First of all, Cluster is a storage engine. It doesn't actually execute queries because it doesn't speak SQL. That is why you use a MySQL server, which parses your queries and sends low-level storage engine API calls to the Cluster data nodes. The data nodes know how to retrieve or store data. Or you can talk to the data nodes directly using the NDB API(s).

MySQL Cluster has various means of executing queries. They boil down to:

  • Primary key lookup

  • Unique key lookup

  • Ordered index scan (i.e., non-unique indexes that use T-trees)

  • Full table scan

Let's say you have 4 data nodes in your cluster (NoOfReplicas=2). This means you have 2 node groups and each one has half the data. Cluster uses a hash on the primary key (unless you've controlled the partitioning using the 5.1 partitioning features). So for any table, half the rows are in one node group and half the rows are the in other node group.

Now for the 4 types of query execution. You can verify which type of execution is used with EXPLAIN. Here's how each ones works:

Primary key lookup - the MySQL server can calculate the hash on the primary key and know which data node the relevant row resides in. Then the MySQL server contacts that data node and receives the row.
Unique key lookup - the MySQL server cannot know which data node the row might be stored in. So it contacts a pseudo-random node. That data node has a hidden table that maps the unique key values to the primary key. Then the hash on the primary key reveals where the row resides.
Ordered index scans are sent in parallel to all data nodes, where they search their local t-tree.
Full table scans are send in parallel to all data nodes, where they scan their primary fragment.

Summary: primary key lookups are best. If you have more than 2 nodes, throughput goes up because all nodes are actively serving different fragments of data. Ordered index lookup and full table scans are done in parallel, so more nodes leads to better performance.