MySQL – How to Write Queries for Hierarchical Data

hierarchyjoin;MySQL

This table is a questions and answers CMS and i want to know how data is fetched from these tables.

+------+----------+-------+
|  id  | parentid | type  |
+------+----------+-------+
|  1   |     0    |   Q   |
+------+----------+-------+
|  2   |     1    |   A   |
+------+----------+-------+
|  3   |     0    |   Q   |
+------+----------+-------+
|  4   |     2    |   C   |
+------+----------+-------+
|  5   |     2    |   C   |
+------+----------+-------+
|  6   |     1    |   A   |
+------+----------+-------+
|  7   |     3    |   C   |
+------+----------+-------+
|  8   |     1    |   C   |
+------+----------+-------+
|  9   |     0    |   Q   |
+------+----------+-------+
|  10  |     9    |   C   |
+------+----------+-------+
|  11  |     9    |   A   |
+------+----------+-------+
|  12  |    11    |   C   |
+------+----------+-------+

For example, the results like below.

+------+----------+-------+
|  id  | parentid | type  |
+------+----------+-------+
|  1   |     0    |   Q   |
+------+----------+-------+
|  8   |     1    |   C   |
+------+----------+-------+
|  2   |     1    |   A   |
+------+----------+-------+
|  4   |     2    |   C   |
+------+----------+-------+
|  5   |     2    |   C   |
+------+----------+-------+
|  6   |     1    |   A   |
+------+----------+-------+

This query are not results sorted as above.and in many tutorials advised that the JOIN is better than it used to be for this table.

SELECT * FROM qa_posts WHERE postid=1
UNION
SELECT * FROM qa_posts WHERE parentid=1
UNION
SELECT * FROM qa_posts WHERE parentid IN (SELECT postid parentid FROM qa_posts WHERE parentid=1)
ORDER BY parentid;

Best Answer

Your structure i commonly known as an Adjacency List (should perhaps be Set) Model. If you don't want to change the structure of your model, you basically have to do something similar to what you are currently doing. You can rewrite your query using joins like:

SELECT q1.* FROM qa_posts q1
WHERE q1.postid=1
UNION
SELECT q2.* FROM qa_posts q1
            JOIN qa_posts q2
                on q1.postid=q2.parentid
WHERE q1.postid=1
UNION
SELECT q3.* FROM qa_posts q1
            JOIN qa_posts q2
                on q1.postid=q2.parentid
            JOIN qa_posts q3
                on q2.postid=q3.parentid
WHERE q1.postid=1;

If you are willing to change the model you can search for Nested Set (example http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/), Materialized Path (example http://bojanz.wordpress.com/2014/04/25/storing-hierarchical-data-materialized-path/) and Transitive Closure Table (at the end of http://www.slideshare.net/billkarwin/models-for-hierarchical-data, at the beginning the other two models are described)

Edit: Adding DFS with different weights according to type:

select postid, parentid, type 
from (
    SELECT q1.*     
        , lpad(q1.parentid,8,case q1.type when 'Q' then 'a' 
                                          when 'C' then 'b' 
                                          when 'A' then 'c' 
                             end)         
        || lpad(q1.postid,8,case q1.type when 'Q' then 'a' 
                                         when 'C' then 'b' 
                                         when 'A' then 'c' 
                             end) as padded_path 
    FROM qa_posts q1 
    WHERE q1.postid=1 
    UNION 
    SELECT q2.*     
       , lpad(q1.parentid,8,case q1.type when 'Q' then 'a' 
                                         when 'C' then 'b' 
                                         when 'A' then 'c' 
                            end)         
        || lpad(q1.postid,8,case q1.type when 'Q' then 'a' 
                                         when 'C' then 'b' 
                                         when 'A' then 'c' 
                            end)         
        || lpad(q2.postid,8,case q2.type when 'Q' then 'a' 
                                         when 'C' then 'b' 
                                         when 'A' then 'c' 
                            end) as padded_path 
    FROM qa_posts q1 
    JOIN qa_posts q2     
        on q1.postid=q2.parentid 
    WHERE q1.postid=1 
    UNION 
    SELECT q3.*     
         , lpad(q1.parentid,8,case q1.type when 'Q' then 'a' 
                                           when 'C' then 'b' 
                                           when 'A' then 'c' 
                              end)         
         || lpad(q1.postid,8,case q1.type when 'Q' then 'a' 
                                          when 'C' then 'b' 
                                          when 'A' then 'c' 
                              end)         
         || lpad(q2.postid,8,case q2.type when 'Q' then 'a' 
                                          when 'C' then 'b' 
                                          when 'A' then 'c' 
                              end)         
         || lpad(q3.postid,8,case q3.type when 'Q' then 'a' 
                                          when 'C' then 'b' 
                                          when 'A' then 'c' 
                              end) as padded_path 
    FROM qa_posts q1 
    JOIN qa_posts q2     
        on q1.postid=q2.parentid 
    JOIN qa_posts q3     
        on q2.postid=q3.parentid 
    WHERE q1.postid=1
) as x 
order by padded_path;

+--------+----------+------+
| postid | parentid | type |
+--------+----------+------+
|      1 |        0 | Q    |
|      8 |        1 | C    |
|      2 |        1 | A    |
|      4 |        2 | C    |
|      5 |        2 | C    |
|      6 |        1 | A    |
+--------+----------+------+
6 rows in set (0.00 sec)

Edit: The other Questions result in (padded_path included):

+--------+----------+------+--------------------------+
| postid | parentid | type | padded_path              |
+--------+----------+------+--------------------------+
|      3 |        0 | Q    | aaaaaaa0aaaaaaa3         |
|      7 |        3 | C    | aaaaaaa0aaaaaaa3bbbbbbb7 |
+--------+----------+------+--------------------------+
2 rows in set (0.00 sec)


+--------+----------+------+----------------------------------+
| postid | parentid | type | padded_path                      |
+--------+----------+------+----------------------------------+
|      9 |        0 | Q    | aaaaaaa0aaaaaaa9                 |
|     10 |        9 | C    | aaaaaaa0aaaaaaa9bbbbbb10         |
|     11 |        9 | A    | aaaaaaa0aaaaaaa9cccccc11         |
|     12 |       11 | C    | aaaaaaa0aaaaaaa9cccccc11bbbbbb12 |
+--------+----------+------+----------------------------------+
4 rows in set (0.00 sec)

Changing the predicate q1.postid=1 to q1.type='Q' results in:

+--------+----------+------+----------------------------------+
| postid | parentid | type | padded_path                      |
+--------+----------+------+----------------------------------+
|      1 |        0 | Q    | aaaaaaa0aaaaaaa1                 |
|      8 |        1 | C    | aaaaaaa0aaaaaaa1bbbbbbb8         |
|      2 |        1 | A    | aaaaaaa0aaaaaaa1ccccccc2         |
|      4 |        2 | C    | aaaaaaa0aaaaaaa1ccccccc2bbbbbbb4 |
|      5 |        2 | C    | aaaaaaa0aaaaaaa1ccccccc2bbbbbbb5 |
|      6 |        1 | A    | aaaaaaa0aaaaaaa1ccccccc6         |
|      3 |        0 | Q    | aaaaaaa0aaaaaaa3                 |
|      7 |        3 | C    | aaaaaaa0aaaaaaa3bbbbbbb7         |
|      9 |        0 | Q    | aaaaaaa0aaaaaaa9                 |
|     10 |        9 | C    | aaaaaaa0aaaaaaa9bbbbbb10         |
|     11 |        9 | A    | aaaaaaa0aaaaaaa9cccccc11         |
|     12 |       11 | C    | aaaaaaa0aaaaaaa9cccccc11bbbbbb12 |
+--------+----------+------+----------------------------------+
12 rows in set (0.00 sec)

Related Solutions

Mysql – Index optimization

I think that the best for your tag and posttag tables is:

ok for the tagid as a primary key
add an index on your tag field (this cover your second query)
add two index on posttag table like INDEX(postid,tagid) and INDEX(tagid,postid) (this cover your first and second query)

So:

CREATE TABLE `posts` (
  `postid` mediumint(9) NOT NULL AUTO_INCREMENT,
  `post` varchar(255) NOT NULL,
  PRIMARY KEY (`postid`)
);

CREATE TABLE `posttags` (
  `postid` mediumint(9) NOT NULL,
  `tagid` mediumint(9) NOT NULL,
  PRIMARY KEY (`postid`,`tagid`),
  UNIQUE KEY `tagid_postid_UQ` (`tagid`,`postid`),
  FOREIGN KEY (postid)
    REFERENCES posts (postid),
  FOREIGN KEY (tagid)
    REFERENCES tags (tagid)
);

CREATE TABLE `tags` (
  `tagid` mediumint(9) NOT NULL AUTO_INCREMENT,
  `tag` char(30) NOT NULL,
  PRIMARY KEY (`tagid`),
  UNIQUE KEY `tag_UQ` (`tag`)
);

And for your query:

mysql> explain select p.* from tags as t    inner join posttags as pt on t.tagid = pt.tagid   inner join posts as p on p.postid = pt.postid  where t.tag =  ?;
+----+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref            | rows | Extra                    |
+----+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+
|  1 | SIMPLE      | t     | ref    | PRIMARY,tag   | tag     | 32      | const          |    1 | Using where; Using index |
|  1 | SIMPLE      | pt    | ref    | postid,tagid  | tagid   | 3       | test.t.tagid   |    1 | Using index              |
|  1 | SIMPLE      | p     | eq_ref | PRIMARY       | PRIMARY | 4       | test.pt.postid |    1 |                          |
+----+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+


explain select p.* from posts as p    inner join posttags as pt on pt.postid = p.postid   inner join tags as t on t.tagid = pt.tagid  where p.postid = ?;
+----+-------------+-------+--------+---------------+---------+---------+---------------+------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref           | rows | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+---------------+------+-------------+
|  1 | SIMPLE      | p     | const  | PRIMARY       | PRIMARY | 4       | const         |    1 |             |
|  1 | SIMPLE      | pt    | ref    | postid,tagid  | postid  | 4       | const         |    2 | Using index |
|  1 | SIMPLE      | t     | eq_ref | PRIMARY       | PRIMARY | 3       | test.pt.tagid |    1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+---------------+------+-------------+

Database Performance – Are Individual Queries Faster Than Joins?

Are individual queries faster than joins, or: Should I try to squeeze every info I want on the client side into one SELECT statement or just use as many as seems convenient?

In any performance scenario, you have to test and measure the solutions to see which is faster.

That said, it's almost always the case that a joined result set from a properly tuned database will be faster and scale better than returning the source rows to the client and then joining them there. In particular, if the input sets are large and the result set is small -- think about the following query in the context of both strategies: join together two tables that are 5 GB each, with a result set of 100 rows. That's an extreme, but you see my point.

I have noticed that when I have to get information from multiple tables, it is "often" faster to get this information via multiple queries on individual tables (maybe containing a simple inner join) and patch the data together on the client side that to try to write a (complex) joined query where I can get all the data in one query.

It's highly likely that the database schema or indexes could be improved to better serve the queries you're throwing at it.

A joined query always has to return more data than the individual queries that receive the same amount of information.

Usually this is not the case. Most of the time even if the input sets are large, the result set will be much smaller than the sum of the inputs.

Depending on the application, very large query result sets being returned to the client are an immediate red flag: what is the client doing with such a large set of data that can't be done closer to the database? Displaying 1,000,000 rows to a user is highly suspect to say the least. Network bandwidth is also a finite resource.

Since the database has to cobble together the data, for large datasets one can assume that the database has to do more work on a single joined query than on the individual ones, since (at least) it has to return more data to the client.

Not necessarily. If the data is indexed correctly, the join operation is more likely to be done more efficiently at the database without needing to scan a large quantity of data. Moreover, relational database engines are specially optimized at a low level for joining; client stacks are not.

Would it follow from this, that when I observe that splitting a client side query into multiple queries yield better performance, this is just the way to go, or would it rather mean that I messed up the joined query?

Since you said you're inexperienced when it comes to databases, I would suggest learning more about database design and performance tuning. I'm pretty sure that's where the problem lies here. Inefficiently-written SQL queries are possible, too, but with a simple schema that's less likely to be a problem.

Now, that's not to say there aren't other ways to improve performance. There are scenarios where you might choose to scan a medium-to-large set of data and return it to the client if the intention is to use some sort of caching mechanism. Caching can be great, but it introduces complexity in your design. Caching may not even be appropriate for your application.

One thing that hasn't been mentioned anywhere is maintaining consistency in the data that's returned from the database. If separate queries are used, it's more likely (due to many factors) to have inconsistent data returned, unless a form of snapshot isolation is used for every set of queries.

Best Answer

Related Solutions

Mysql – Index optimization

Database Performance – Are Individual Queries Faster Than Joins?

Related Question