MySQL Best Practices – Using Unions or Derived Tables

join;MySQLsubqueryunion

I've inherited a medium-sized database with a terrible schema. The sanitized portion in question is like so:

CREATE TABLE `pending` (
  ...
  `invoice` int(11) DEFAULT NULL,
  `lid` int(11) DEFAULT NULL,
  `custacct` varchar(21) DEFAULT NULL,
  UNIQUE KEY `pend_inv` (`invoice`),
  KEY `pend_acct` (`custacct`),
  KEY `pend_pid` (`pid`)
  ...
) ENGINE=InnoDB

CREATE TABLE `done` (
  ...
  `invoice` int(11) DEFAULT NULL,
  `lid` int(11) DEFAULT NULL,
  `custacct` varchar(21) DEFAULT NULL,
  UNIQUE KEY `done_inv` (`invoice`),
  KEY `done_acct` (`custacct`),
  KEY `done_pid` (`pid`)
  ...
) ENGINE=InnoDB

And two similar tables, customers and locations:

CREATE TABLE `customers` (
  `acct` varchar(14) NOT NULL,
  ...
  PRIMARY KEY (`acct`),
  ...
) ENGINE=InnoDB

So, yes, when an invoice is not yet fulfilled, it's in the "pending" table, then when the company's finished with it, it gets moved to the "done" table. So any given invoice will only be on one of the two tables.

Unfortunately, we're still actively using the proprietary software that runs on this schema, so I am powerless to fix it. The "pending" table will usually have around 9000 rows, whereas the "done" table is ~800,000 and counting.

For two years (the full extent of my LAMP/MySQL experience), I have been writing UNIONs to capture information from the customers or locations tables, with the first query joining customers/location/both against "pending" and the second joining customers/locations/both against "done."

I've recently had a nagging feeling that writing two nearly-identical queries UNIONed together is probably newbish and should be done differently. So I finally applied my more recent learning to a problem I had "solved" in my first few weeks of working with MySQL and realized that I can just use a derived table with a UNION inside that subquery to achieve the same results with less typing.

The execution time is near-similar, usually 0.01s for the derived query and slightly less than that for the UNIONed query, but the dervied table query uses far fewer characters and requires less time to type out.

So my question for those with more experience, is using a derived table subquery more sane and maintainable in the long run? It's easier for me, the programmer, so that makes me suspicious that maybe it's not the best practice!

Sample queries below:

select c.whatever,l.whatever from customers c join
  (select d.custacct as acct,d.lid from done d where d.invoice=123456
     union
   select p.custacct,p.lid from pending p where p.invoice=123456
  ) as combi
    on c.acct=combi.acct join locations l on combi.lid=l.lid;


select c.whatever,l.whatever from done d join customers c on d.custacct=c.acct join locations l on d.lid=l.lid where d.invoice=123456
union select c2.whatever,l2.whatever from pending p join customers c2 on p.custacct=c2.acct join locations l2 on p.lid=l2.lid where p.invoice=123456;

Best Answer

In your example query, neither way is particularly better, because the inner queries have their own where clauses, are generating a very small result-set, and are able to do this very efficiently presumably because the invoice column in pending and done are indexed.

If you were doing something more complex, and especially without such highly-selective where clauses in the inter queries, the union would likely perform better... because in the derived table scenario, the derived table is materialized before the joins to the outer tables and indexes on the tables used in the inner query can't be used to handle the join... on the other hand, with UNION, the optimizer could use indexes on pending/done if they were appropriate.

Personally, I'd be inclined to solve this one by writing one or more stored procedures to encapsulate the union queries and return the answer as a result set using an unbounded SELECT.

MySQL supports a very useful extension that enables the use of regular SELECT statements (that is, without using cursors or local variables) inside a stored procedure. The result set of such a query is simply sent directly to the client.

http://dev.mysql.com/doc/refman/5.5/en/stored-routines-syntax.html

These are called "unbounded." So, a stored procedure that takes the invoice id as an input parameter, does the SELECT ... UNION ALL ... SELECT can encapsulate everything and give you a very clean way to ask the database for what you need and get a record-set in return.

CALL get_invoice_customer_location_detail_by_invoice_id(123456); # long, but much more appealing

If you subsequently decide union or derived really was The One True Way after all, then you only have to update that in one place, in the procedure definition.

Free tip: UNION means UNION DISTINCT in non-dinosaur versions of MySQL, while some older versions will interpret it as UNION ALL. It's probably good practice to explicitly use the one you actually want, so you always get the behavior you expect. It's also generally good from a performance standpoint to use UNION ALL unless you need UNION DISTINCT because it's one less thing for the optimizer to have to deal with, but this is much more the case with large result sets, where a lot of time could be spent de-duplicating a set of data that is already free of duplicates.