Mysql – Query to compare two subsets of data from the same table

join;MySQLview

My problem:

I'm trying to compare to subsets of data within one table, and I've got two methods that partially work and one certainty that there's got to be a more correct way to do it.

The idea here is a table which contains datasets about the same systems over time, and I would like to compare them and especially to see when there are introductions or absences. Allow me to demonstrate with a simple test table:

mysql> select * from gocore;
+-----+------------+------+----------+----------+
| uid | server     | tag  | software | revision |
+-----+------------+------+----------+----------+
|   1 | enterprise | old  | apache   | 2.2.25   |
|   2 | enterprise | new  | apache   | 2.4.6    |
|   3 | enterprise | new  | tomcat   | 7.0.42   |
|   4 | enterprise | old  | geronimo | 2.1.7    |
+-----+------------+------+----------+----------+

In this example there are two datasets – the tag "old" dataset, and the tag "new" dataset. Each reflects a data sample taken at some point in time. For server "enterprise", we have one software package that changed over time (apache), one software package that was introduced (tomcat), and one software package that became absent (geronimo).

My goal: A query that will allow me to summarize the state between "old" and "new":

+------------+----------+----------+----------+
| server     | software | revision | revision |
+------------+----------+----------+----------+
| enterprise | apache   | 2.2.25   | 2.4.6    |
| enterprise | geronimo | 2.1.7    | NULL     |
| enterprise | tomcat   | NULL     | 7.0.42   |
+------------+----------+----------+----------+

It is important for my purposes to be able to see the 'NULL' cells above – I need to know when software has been added or removed from the system. TO BE CLEAR, the table above is not the result of a query – it's me using a text editor to fix up what I was getting to describe what I'm looking for. I need your help to figure out the query that would make that table 🙂

My Kludges:

If I perform a LEFT JOIN and use the WHERE clause to discriminate between "old" and "new" tags, I get results for only those entries that exist under both tags:

mysql> select old.server, old.software, old.revision, new.software, new.revision
    -> from gocore as old left join gocore as new on old.software = new.software
    -> where old.tag = 'old' and new.tag = 'new';
+------------+----------+----------+----------+----------+
| server     | software | revision | software | revision |
+------------+----------+----------+----------+----------+
| enterprise | apache   | 2.2.25   | apache   | 2.4.6    |
+------------+----------+----------+----------+----------+

My next try was to create two views so that I could perform the JOIN without throwing the tag filter into the mix:

mysql> create view gc_old as select uid,server,tag,software,revision
    -> from gocore where tag = 'old';
Query OK, 0 rows affected (0.00 sec)

mysql> create view gc_new as select uid,server,tag,software,revision
    -> from gocore where tag = 'new';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from gc_old;
+-----+------------+------+----------+----------+
| uid | server     | tag  | software | revision |
+-----+------------+------+----------+----------+
|   1 | enterprise | old  | apache   | 2.2.25   |
|   4 | enterprise | old  | geronimo | 2.1.7    |
+-----+------------+------+----------+----------+
2 rows in set (0.00 sec)

mysql> select * from gc_new;
+-----+------------+------+----------+----------+
| uid | server     | tag  | software | revision |
+-----+------------+------+----------+----------+
|   2 | enterprise | new  | apache   | 2.4.6    |
|   3 | enterprise | new  | tomcat   | 7.0.42   |
+-----+------------+------+----------+----------+
2 rows in set (0.00 sec)

mysql> select old.server, old.software, old.revision, new.revision
    -> from gc_old as old left join gc_new as new
    -> on old.software = new.software;
+------------+----------+----------+----------+
| server     | software | revision | revision |
+------------+----------+----------+----------+
| enterprise | apache   | 2.2.25   | 2.4.6    |
| enterprise | geronimo | 2.1.7    | NULL     |
+------------+----------+----------+----------+
2 rows in set (0.00 sec)

That works better – I now see absence, not just change, but I still don't see introduction. And I had to go create views for this, which strikes me as inflexible as the dataset is added to over time.

My questions:

  1. How can I make sure that all the holes show up, as represented by the NULL cells in the "My goal" section above?
  2. Can I do it within a single query, avoiding the creation of views as a crutch?

Any help you can provide is greatly appreciated. Thanks!

Best Answer

I think you have to hack it a bit with a derived table, AKA an implicit temporary table, AKA a "subquery in the from clause."

We derive a table we'll call `t` containing each distinct (server,software) from gocore, then left join to gocore twice, once on tag = 'old' and once on tag = 'new'.

SELECT t.server, t.software, o.revision AS old_rev, n.revision AS new_rev
  FROM (SELECT DISTINCT server, software FROM gocore) t
  LEFT JOIN gocore o ON o.server = t.server AND o.software = t.software AND o.tag = 'old'
  LEFT JOIN gocore n ON n.server = t.server AND n.software = t.software AND n.tag = 'new';