Mysql – Filtering subquery results with group by

group byjoin;MySQLsubquery

I have a join that works but seems hacky because it orders the subquery and uses group by to filter everything but the top subquery result. Without doing so, the cartesian product of s and print_issues.pub_date would be returned.

SELECT 
    print_issues.pub_date, 
    print_issues.x,
    s.page_size 
FROM print_issues, 
  ( select pub_date, page_size 
    from print_page_size, 
          print_issues 
    where effective_date < print_issues.pub_date 
    order by effective_date desc
  ) as s 
WHERE s.pub_date = print_issues.pub_date 
group by print_issues.pub_date 
order by print_issues.pub_date desc

The query is an artifact of the schema of print_page_size. That table has the columns page_size and effective-date. New entries with a greater effective_date supersede old ones. The goal of the query is to unite the print_issue table with the page_size in effect when each issue was published.

Sample Data:

TABLE: print_page_size
======================
effective_date      page_size
2014-01-01          100
2017-05-01          105

TABLE: print_issues
===================
pub_date      x
2017-04-26    "Random"
2017-05-02    "Data"

OUTPUT
======
print_issues.pub_date          print_issues.x    s.page_size
2017-04-26                     "Random"          100
2017-05-02                     "Data"            105      

My questions: Is there a problem with this approach? Is there a more direct way to to achieve the desired result?

Best Answer

Your query is non-deterministic as it abuses GROUP BY.

It may give incorrect/unexpected results in any version of MySQL, depending on the execution plan and available indexes. You may not have seen such behaviour because it hasn't happened yet (you were just lucky or the indexes and size of the table have up to now lead to plans that give correct results) or it happened but you didn't notice.

There are versions (MariaDB 5.3+) where similar queries - not sure about this one but similar - will almost always give incorrect results due to some optimizations that remove the redundant ORDER BY in the subquery.

There are versions of MySQL (5.7) where you'll get an error if you try the query - exactly because it might give inconsistent results. In that version, you'll have to change the default settings (and get the wrong non-deterministic behaviour) or you'll have to rewrite/correct the query.

In short, don't use it.


I would write the query like this. Much simpler, deterministic and probably more efficient as there is no GROUP BY and doesn't join the pi table twice:

SELECT 
         pi.pub_date, 
         pi.x,
         ( SELECT   pps.page_size
           FROM     print_page_size AS pps
           WHERE    pps.effective_date < pi.pub_date
           ORDER BY pps.effective_date DESC
           LIMIT    1
         ) AS page_size
FROM  
         print_issues AS pi
ORDER BY 
         pub_date ;              -- or whatever

An index on print_page_size (effective_date, pagesize) would be further help.