Mysql – Using GROUP BY WITH ROLLUP on column that contains NULL

group byMySQLnull

I have the following table (see it on SQL Fiddle) (I created to break down my problem):

| ID | word    |
----------------
| 5  | "Hello" |
| 6  |  NULL   |
| 7  | "World" |
| 8  | "World" |

Now I want to count the number of occurrences of each word using GROUP BY word WITH ROLLUP. The NULL in the column word of the row generated by the ROLLUP should be replaced by "total":

SELECT
  ID,
  ifnull(word, "total") as word,
  count(*) as occurrences
FROM test
GROUP BY word WITH ROLLUP;

The problem is that it also replaces the NULL in the record with the amount of rows where words is NULL:

| ID |  word | occurrences |
|----|-------|-------------|
|  6 | total |           1 | <- Here lies the problem
|  5 | Hello |           1 |
|  7 | world |           2 |
|  7 | total |           4 |

So I now use count(word) to distinguish whether it's a NULL from my data or it's a NULL created by ROLLUP so I can replace it with either "empty" or "total":

SELECT
  ID,
  if(count(word) = 0, "empty", ifnull(word, "total")) as word,
  count(*) as occurrences
FROM test
group by word with ROLLUP;

This works for my use case but it still has a flaw:
If word is NULL in all rows, count(word) is also 0 in the final row showing the total: (SQL Fiddle with different data (only record#6))

| ID |  word | occurrences |  
|----|-------|-------------|  
|  6 | empty |           1 |  
|  6 | empty |           1 |  <- word should be "total"

The following statement delivers the result I want:

SELECT
  ID,
  ifnull(word, "total") as word,
  count(*) as occurrences
FROM
(
SELECT
  ID,
  ifnull(word, "empty") as word
FROM test
) tmp
GROUP BY word WITH ROLLUP

But since in most answers around here the use of subqueries seems to be discouraged, I am wondering if there is a better solution.

Best Answer

you have a "solution", so by saying

I am wondering if there is a better solution.

I think you want to comment about it. Let me focus on several aspects:

  1. SELECT ID [...] GROUP BY word, independently of having or not a WITH ROLLUP is wrong. You are selecting a field whose value is undetermined. In particular, with MySQL you get exposed of returning a random value; or if you use an strict mode (recommended, and by default in the latest versions of MySQL), of the query to fail: 'db_9_b0ff7.test.ID' isn't in GROUP BY

  2. Be it formally or not, WITH ROLLUP Note- (I refer to MySQL implementation) is something that is nice to have in some cases because it avoids a double rescan of the table, but I have not seen a lot in production code. Nothing wrong with it, but they were an exclusive MSSQL feature until it was implemented on SQL1999, and either it was not available widely back in the day or, like in the case of MySQL it was a limited implementation. Reference: https://www.percona.com/blog/2007/09/17/using-group-by-with-rollup-for-reporting-performance-optimization/

  3. I do not see in principle a problem with your final query, if we run explain on both queries (without ID), we see:

    MariaDB [test]> EXPLAIN SELECT   if(count(word) = 0, "empty", ifnull(word, "total")) as word,   count(*) as occurrences FROM test group by word with ROLLUP\G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: test
             type: ALL
    possible_keys: NULL
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 4
            Extra: Using filesort
    1 row in set, 1 warning (0.00 sec)
    
    Warning (Code 1052): Column 'word' in group statement is ambiguous
    MariaDB [test]> EXPLAIN SELECT   ifnull(word, "total") as word,   count(*) as occurrences FROM ( SELECT   ifnull(word, "empty") as word FROM test ) tmp GROUP BY word WITH ROLLUP\G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: test
             type: ALL
    possible_keys: NULL
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 4
            Extra: Using filesort
    1 row in set, 1 warning (0.01 sec)
    

    Normally the hate towards MySQL is because its limitations on early versions of MySQL. I am not going to say that it is now perfect, there will be still issues, as MySQL development, as it has always been, focused on a fast, easy-in, easy-out RDBMS, and not a full-blown relational engine, but a subquery in the right place is ok (after all, some queries will require a subquery, no matter what. Your usage is ok- it will only have an overhead of going over the summarized rows again and I suppose on the real deal, those will not be significant compared to the full set.

    Note- on my instance up here, the subquery is not showed, but you can see it with materialized subqueries in 5.6 here: http://sqlfiddle.com/#!9/b0ff7/7 (It will execute the inner query and then the outer part, with no issue). An index could definitely help here (due to the filesort), but that depends on the final number of records. The subquery may hurt performance if an index (which does not exist now) cannot be used because it. You have to test your specific mysql version and create such an index.

  4. I think your problem is with the data model- in the SQL world (outside of Oracle) NULL is an unknown/invalid value. Seem to be correlating that to 'empty' which is not right- if a word is known to not exist, it should be '' (the empty string, not NULL). I do not know if you can change that, but I do not like your current model. I understand that you may not have something to say about it, so I will not focus on that. Alos you count NULL words? More reasons to use '' instead of NULL.
  5. Now the final question is, is it possible to do it without a subquery on MySQL, and will it be better/less ugly? I do not really see a way- we could do a join, but it will still be to a subquery. I can think of an alternative, which is:

    SELECT * FROM (
        SELECT   word,   
        count(*) as occurrences 
        FROM test 
        GROUP BY word with rollup) 
    tmp 
    ORDER BY occurrences;
    

    That makes sure that the last row contains the summary (and if there is other null, it is the actual null value). The order would happen even on the case of only null values. On the contrary, it may even slower as it requires ordering, and again you may need an index, and that index may or may not be used.