Db2 – Filtering on a virtual column from a UNION ALL with LEFT OUTER join causes a table scan

db2

I asked this question on SO but havn't got many bites and even a vote to close for being "Off Topic." So, I thought this stack exchange might be better suited.

Consider the following SQL…

SELECT allStuff.id
  FROM    (SELECT id, 1 AS clazz_
             FROM MyStuff
           UNION ALL
           SELECT id, 2 AS clazz_
             FROM MyWarehousedStuff) allStuff
       LEFT OUTER JOIN
          Chilren child
       ON allStuff.id = child.id
 WHERE allStuff.id IN (/*fast sub select */)
       AND allStuff.clazz_= ? /* if I remove this line the query is fast (index fast)
                                 but if it stays in then DB2 does a tablescan
                                 on MyStuff and MyWarehousedStuff (very expensive) */

To make things a bit more confusing, if I remove the LEFT OUTER JOIN or make it an INNER JOIN DB2 does the right thing. I find it highly unusuall that an AND would cause worse perforamnce. I am also surprised by the use of a table scan when the field in question isn't actually on the table and is constant.

Note, this is a cleaned up Hibernate query so I'm not to worried about fixing the SQL in question but rather I want to know why DB2 is acting a particular way.

What would be causing this behavior?

[update] – It appears that even adding allStuff.clazz_ to the column list is enough to trigger the table scanning.

[update] – Also seems that the IN subslect is part of the problem. Switching that to a literal subselect IN (?, ?) is fast but IN (select 1 from SYSIBM.SYSDUMMY1) is not

[update] – Using DB2 v9.1 z/OS

[update] – To be clear, the following is also slow

SELECT allStuff.id, allStuff.clazz_
  FROM    (SELECT id, 1 AS clazz_
             FROM MyStuff
           UNION ALL
           SELECT id, 2 AS clazz_
             FROM MyWarehousedStuff) allStuff
       LEFT OUTER JOIN
          Chilren child
       ON allStuff.id = child.id
 WHERE allStuff.id IN (select 1 from sysibm.sysdummy1)

But this is fast…

SELECT allStuff.id, allStuff.clazz_
  FROM    (SELECT id, 1 AS clazz_
             FROM MyStuff
           UNION ALL
           SELECT id, 2 AS clazz_
             FROM MyWarehousedStuff) allStuff
       LEFT OUTER JOIN
          Chilren child
       ON allStuff.id = child.id
 WHERE allStuff.id IN (?, ?)

Fast…

SELECT allStuff.id, allStuff.clazz_
  FROM    (SELECT id, 1 AS clazz_
             FROM MyStuff
           UNION ALL
           SELECT id, 2 AS clazz_
             FROM MyWarehousedStuff) allStuff
       INNER JOIN
          Chilren child
       ON allStuff.id = child.id
 WHERE allStuff.id IN (?, ?)

Fast…

SELECT allStuff.id, allStuff.clazz_
  FROM    (SELECT id, 1 AS clazz_
             FROM MyStuff
           UNION ALL
           SELECT id, 2 AS clazz_
             FROM MyWarehousedStuff) allStuff
 WHERE allStuff.id IN (select 1 from sysibm.sysdummy1)

Fast…

SELECT allStuff.id
  FROM    (SELECT id, 1 AS clazz_
             FROM MyStuff
           UNION ALL
           SELECT id, 2 AS clazz_
             FROM MyWarehousedStuff) allStuff
       LEFT OUTER JOIN
          Chilren child
       ON allStuff.id = child.id
 WHERE allStuff.id IN (select 1 from sysibm.sysdummy1)

Best Answer

Just off the top of my head, it makes sense that it needs to tablescan those two tables. They are both present in a UNION ALL. This essentially means all records are in consideration. You then also use a LEFT OUTER JOIN meaning everything from the left side (your two union'd) tables even if it doesn't match the right side (your Children table). This is telling DB2 that no matter what every possible record in the two tables of your UNION ALL is under consideration.

Now...to fix that problem. I would ask, do you have any indexes over the clazz column in both tables? Since clazz is a predicate of your where clause you may wish to have a index over that field in both tables (MyStuff and MyWarehouseStuff). You don't need an index over the id fields (as long as they are primary keys), because you gain that by default in DB2.

To better verify what DB2 is doing, you should use the visual explain (of course you may have done that). It should show you why it is picking what it is. A few other things to keep in mind when examining the explain, do you run stats? If you don't, you should. Running stats will keep DB2 better informed as to statistics about its tables, data, indexes, etc, and may cause it to choose a better access path. Also, this may seem a silly question, what version and fix pack of DB2 are you on? Occasionally certain versions/fix packs end up buggy and there are performance issues due to poor access path selection by the optimizer. An example of this, if you are on DB2 9.7 then you want to be on fix pack 4 or 5. There is at least one fix pack below those that had serious performance issues due to a bug in the optimizer.

Anyway, check out your indexes first. Adding one to both the tables I mentioned may solve your problem. Worse case, I suppose you could re-write your query to put the clazz predicate up into the sub select "table". That may cause the optimizer to eliminate rows before it unions and joins. Just another thought.

EDIT: Here is a thought on your query

SELECT allStuff.id
  FROM    (SELECT id, 1 AS clazz_
             FROM MyStuff
           UNION ALL
           SELECT id, 2 AS clazz_
             FROM MyWarehousedStuff) allStuff
       LEFT OUTER JOIN
          Chilren child
       ON allStuff.id = child.id
 WHERE allStuff.id IN (/*fast sub select */)
       AND allStuff.clazz_= ? 

You are using a constant basically that from what I understand doesn't exist in the table. You also are not selecting this, but you have applied it to every row, and you use it in the WHERE clause. So I ask why do this? It really is just saying select every row. You may as well be doing select *. This will contribute to the table scan. Plus you are not displaying this value back. So why use it? Why not rather do the following:

SELECT allStuff.id
  FROM    (SELECT id FROM MyStuff
           UNION ALL
           SELECT id, FROM MyWarehousedStuff) allStuff
       LEFT OUTER JOIN
          Chilren child
       ON allStuff.id = child.id
 WHERE allStuff.id IN (/*fast sub select */)

Since you are not selecting the data, unless you oops'd on your query you are posting, isn't this what you would rather have? It union (all)'s the ids, joins them to the child table and then returns what you want?

Although I would also question the point of your LEFT OUTER JOIN. Because if your left table (in this case allStuff) has a NULL for the id it shouldn't match your "fast sub select", unless that allows NULL's as well. (Does it?), so you would be excluding these. I don't know the context of what you are trying to accomplish so perhaps I am not speaking to it appropriately, but it makes me question having a LEFT OUTER JOIN as well.