I have the following table (see it on SQL Fiddle) (I created to break down my problem):
| ID | word |
----------------
| 5 | "Hello" |
| 6 | NULL |
| 7 | "World" |
| 8 | "World" |
Now I want to count the number of occurrences of each word using GROUP BY word WITH ROLLUP
. The NULL in the column word of the row generated by the ROLLUP should be replaced by "total":
SELECT
ID,
ifnull(word, "total") as word,
count(*) as occurrences
FROM test
GROUP BY word WITH ROLLUP;
The problem is that it also replaces the NULL
in the record with the amount of rows where words is NULL:
| ID | word | occurrences |
|----|-------|-------------|
| 6 | total | 1 | <- Here lies the problem
| 5 | Hello | 1 |
| 7 | world | 2 |
| 7 | total | 4 |
So I now use count(word)
to distinguish whether it's a NULL from my data or it's a NULL created by ROLLUP so I can replace it with either "empty" or "total":
SELECT
ID,
if(count(word) = 0, "empty", ifnull(word, "total")) as word,
count(*) as occurrences
FROM test
group by word with ROLLUP;
This works for my use case but it still has a flaw:
If word is NULL in all rows, count(word)
is also 0 in the final row showing the total: (SQL Fiddle with different data (only record#6))
| ID | word | occurrences |
|----|-------|-------------|
| 6 | empty | 1 |
| 6 | empty | 1 | <- word should be "total"
The following statement delivers the result I want:
SELECT
ID,
ifnull(word, "total") as word,
count(*) as occurrences
FROM
(
SELECT
ID,
ifnull(word, "empty") as word
FROM test
) tmp
GROUP BY word WITH ROLLUP
But since in most answers around here the use of subqueries seems to be discouraged, I am wondering if there is a better solution.
Best Answer
you have a "solution", so by saying
I think you want to comment about it. Let me focus on several aspects:
SELECT ID [...] GROUP BY word
, independently of having or not aWITH ROLLUP
is wrong. You are selecting a field whose value is undetermined. In particular, with MySQL you get exposed of returning a random value; or if you use an strict mode (recommended, and by default in the latest versions of MySQL), of the query to fail: 'db_9_b0ff7.test.ID' isn't in GROUP BYBe it formally or not,
WITH ROLLUP
Note- (I refer to MySQL implementation) is something that is nice to have in some cases because it avoids a double rescan of the table, but I have not seen a lot in production code. Nothing wrong with it, but they were an exclusive MSSQL feature until it was implemented on SQL1999, and either it was not available widely back in the day or, like in the case of MySQL it was a limited implementation. Reference: https://www.percona.com/blog/2007/09/17/using-group-by-with-rollup-for-reporting-performance-optimization/I do not see in principle a problem with your final query, if we run explain on both queries (without ID), we see:
Normally the hate towards MySQL is because its limitations on early versions of MySQL. I am not going to say that it is now perfect, there will be still issues, as MySQL development, as it has always been, focused on a fast, easy-in, easy-out RDBMS, and not a full-blown relational engine, but a subquery in the right place is ok (after all, some queries will require a subquery, no matter what. Your usage is ok- it will only have an overhead of going over the summarized rows again and I suppose on the real deal, those will not be significant compared to the full set.
Note- on my instance up here, the subquery is not showed, but you can see it with materialized subqueries in 5.6 here: http://sqlfiddle.com/#!9/b0ff7/7 (It will execute the inner query and then the outer part, with no issue). An index could definitely help here (due to the filesort), but that depends on the final number of records. The subquery may hurt performance if an index (which does not exist now) cannot be used because it. You have to test your specific mysql version and create such an index.
Now the final question is, is it possible to do it without a subquery on MySQL, and will it be better/less ugly? I do not really see a way- we could do a join, but it will still be to a subquery. I can think of an alternative, which is:
That makes sure that the last row contains the summary (and if there is other null, it is the actual null value). The order would happen even on the case of only null values. On the contrary, it may even slower as it requires ordering, and again you may need an index, and that index may or may not be used.