Mysql – ORDER BY on aggregate field on a closure table giving unexpected results

aggregatedatabase-designMySQLsorting

I am updating my companies product database and I'm looking into different ways to do design our faceted navigation "selector" tables. The idea of faceted navigation might look like this:

enter image description here

It's hierarchical data. I have been reading through Bill Karwin's slides on different design patterns for hierarchical data.

I'm trying out the "Closure Table" pattern. I have these two tables that look like this:

CREATE TABLE Selector (
  SelectorID  INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
  Name VARCHAR(32),
  SortOrder INT
);

CREATE TABLE SelectorPath (
  AncestorID INT NOT NULL,  
  DescendantID INT NOT NULL, 
  PathLength INT NOT NULL, 
  PRIMARY KEY(AncestorID, DescendantID),
  FOREIGN KEY (AncestorID) REFERENCES Selector(SelectorID),
  FOREIGN KEY (DescendantId) REFERENCES Selector(SelectorID)
);

I have been testing this concept on SQLFiddle.com here: http://sqlfiddle.com/#!2/881b8f/8

An ERD for SQLFiddle is:
enter image description here

The problem I'm running into is the ability to sort siblings. I'm not sure the ERD file will show up very big but the idea is that at a given node, there might be an arbitrary sorting pattern implemented by the marketing team. They may want "white" to be the last color on the list or "Bamboo" to be the first item in the material column.

In the ERD, the yellow-ish nodes (at depth of 2 where "root" is at depth of 0) would be the category names such as "material" or "color" and the leaf nodes (green on ERD) would be the values in those categories.

The second ERD picture shows the leaf nodes in red that have been properly sorted based on the sort value given for that node.

The way I'm trying to do this sorting is by using GROUP_CONCAT on the sort value like so:

  GROUP_CONCAT(s.SortOrder separator '') AS SortPath

This creates a table with a path string of integers. Below is the original inserted table structure as well as the desired sorted order.

Original Table Structure
Properly Sorted Table

My first attempt was to simply use ORDER BY SortPath. This gives results I get are simply bizarre to me. The ordering of the aggregate field ends up changing the actual SortPath value. Here are bizarre results next to the original table:

Original Table Structure
Unexpected ORDER BY result

According to this SO answer, it would seem like this should work.

Instead, I have to use a derived table and sort on the derived table to get the results I expect.

My ultimate question is, "Why is my SortPath field being changed when using ORDER BY?"

I speculate that ORDER BY actually happens before the aggregation, but I'm not finding much and I'm not sure if there might be something else in my query that's throwing things off.

Any additional comments on this design idea of sorting a Closure Table would be appreciated as well. The current database structure is an Adjacency List.

All three SQL statements (table, sorted table, derived sorted table) are in my SQL fiddle.

My biggest concern is that of performance of doing a derived table. It appears to take 8ms to do a derived table while only 4ms to sort the table.

I will be long term caching DB results at the application level, so in some ways the point is moot. However, I'm looking to learn and experience what's going on here.

Thanks!

Best Answer

Your sortpath isn't technically being "changed" by the ORDER BY. The correct values are all there, just not coming up in the order you anticipated. In SQL, if you don't explicitly sort a set of data, the order in which it comes back to you is undefined... and the fact that it originally appeared to be in the anticipated order was more along the lines of coincidental.

I tweaked your fiddle (the "sorted" query):

GROUP_CONCAT(s.SortOrder ORDER BY a.PathLength DESC separator '') AS SortPath

Now, it looks like the results are being explicitly sorted in the order you expect, which is the order in which they were implicitly sorted before.

So why did that order change? If you do an EXPLAIN SELECT on the "sorted" query with and without the ORDER BY clause, you'll notice that the ordering of the tables changes in the output of EXPLAIN depending on whether you specify ORDER BY on the entire query. This is because the query optimizer re-orders the join-order of the tables to what it thinks is the most cost-effective sequence.

With the ORDER BY, the tables are joined from left to right, n-d-a-s. Without the ORDER BY, the tables are joined from left to right, d-n-a-s. Both are logically valid, but this causes the rows to be read from the tables in a different order, and your unsorted GROUP CONCAT's output order changes as a result of the change to the query plan.