I've inherited a medium-sized database with a terrible schema. The sanitized portion in question is like so:
CREATE TABLE `pending` (
...
`invoice` int(11) DEFAULT NULL,
`lid` int(11) DEFAULT NULL,
`custacct` varchar(21) DEFAULT NULL,
UNIQUE KEY `pend_inv` (`invoice`),
KEY `pend_acct` (`custacct`),
KEY `pend_pid` (`pid`)
...
) ENGINE=InnoDB
CREATE TABLE `done` (
...
`invoice` int(11) DEFAULT NULL,
`lid` int(11) DEFAULT NULL,
`custacct` varchar(21) DEFAULT NULL,
UNIQUE KEY `done_inv` (`invoice`),
KEY `done_acct` (`custacct`),
KEY `done_pid` (`pid`)
...
) ENGINE=InnoDB
And two similar tables, customers and locations:
CREATE TABLE `customers` (
`acct` varchar(14) NOT NULL,
...
PRIMARY KEY (`acct`),
...
) ENGINE=InnoDB
So, yes, when an invoice is not yet fulfilled, it's in the "pending" table, then when the company's finished with it, it gets moved to the "done" table. So any given invoice will only be on one of the two tables.
Unfortunately, we're still actively using the proprietary software that runs on this schema, so I am powerless to fix it. The "pending" table will usually have around 9000 rows, whereas the "done" table is ~800,000 and counting.
For two years (the full extent of my LAMP/MySQL experience), I have been writing UNIONs to capture information from the customers or locations tables, with the first query joining customers/location/both against "pending" and the second joining customers/locations/both against "done."
I've recently had a nagging feeling that writing two nearly-identical queries UNIONed together is probably newbish and should be done differently. So I finally applied my more recent learning to a problem I had "solved" in my first few weeks of working with MySQL and realized that I can just use a derived table with a UNION inside that subquery to achieve the same results with less typing.
The execution time is near-similar, usually 0.01s for the derived query and slightly less than that for the UNIONed query, but the dervied table query uses far fewer characters and requires less time to type out.
So my question for those with more experience, is using a derived table subquery more sane and maintainable in the long run? It's easier for me, the programmer, so that makes me suspicious that maybe it's not the best practice!
Sample queries below:
select c.whatever,l.whatever from customers c join
(select d.custacct as acct,d.lid from done d where d.invoice=123456
union
select p.custacct,p.lid from pending p where p.invoice=123456
) as combi
on c.acct=combi.acct join locations l on combi.lid=l.lid;
select c.whatever,l.whatever from done d join customers c on d.custacct=c.acct join locations l on d.lid=l.lid where d.invoice=123456
union select c2.whatever,l2.whatever from pending p join customers c2 on p.custacct=c2.acct join locations l2 on p.lid=l2.lid where p.invoice=123456;
Best Answer
In your example query, neither way is particularly better, because the inner queries have their own where clauses, are generating a very small result-set, and are able to do this very efficiently presumably because the
invoice
column inpending
anddone
are indexed.If you were doing something more complex, and especially without such highly-selective where clauses in the inter queries, the union would likely perform better... because in the derived table scenario, the derived table is materialized before the joins to the outer tables and indexes on the tables used in the inner query can't be used to handle the join... on the other hand, with
UNION
, the optimizer could use indexes on pending/done if they were appropriate.Personally, I'd be inclined to solve this one by writing one or more stored procedures to encapsulate the union queries and return the answer as a result set using an unbounded
SELECT
.These are called "unbounded." So, a stored procedure that takes the invoice id as an input parameter, does the
SELECT
...UNION ALL
...SELECT
can encapsulate everything and give you a very clean way to ask the database for what you need and get a record-set in return.If you subsequently decide union or derived really was The One True Way after all, then you only have to update that in one place, in the procedure definition.
Free tip:
UNION
meansUNION DISTINCT
in non-dinosaur versions of MySQL, while some older versions will interpret it asUNION ALL
. It's probably good practice to explicitly use the one you actually want, so you always get the behavior you expect. It's also generally good from a performance standpoint to useUNION ALL
unless you needUNION DISTINCT
because it's one less thing for the optimizer to have to deal with, but this is much more the case with large result sets, where a lot of time could be spent de-duplicating a set of data that is already free of duplicates.