Mariadb – Using multiple GROUP BYs to get fewer records, not more

greatest-n-per-groupgroup bymariadb

I feel like this is a simple thing to do, but I've been playing around with trying to achieve the desired result and the only way I'm coming up is with 2 nested subqueries, which doesn't seem great from a performance perspective.

I have a table called billing which is currently 60,000 rows and growing at an increasing rate, with maybe 100 or 200 new entries per day.

Here is a query I wrote that returns exactly what I want:

SELECT  *
    FROM  
    (
        SELECT  COUNT(*) AS calls, tmp.*
            FROM  
            (
                SELECT  b.id AS bid, u.id AS uid, b.rt_time,
                        b.clli, u.city, u.state
                    FROM  nrt.billing AS b
                    LEFT JOIN  users AS u  ON b.switchowner = u.id
                    ORDER BY  b.id DESC
                    LIMIT  18446744073709551615
            ) AS tmp
            GROUP BY  clli
            ORDER BY  bid DESC
            LIMIT  18446744073709551615
    ) AS tmp2
    GROUP BY  uid
    ORDER BY  bid DESC

(The LIMIT weirdness is a MariaDB hack to make it behave like MySQL.)

I initially had just:

SELECT  COUNT(*) AS calls, tmp.*
    FROM  
    (
        SELECT  b.id AS bid, u.id AS uid, b.rt_time, b.clli, u.city, u.state
            FROM  nrt.billing AS b
            LEFT JOIN  users AS u  ON b.switchowner = u.id
            ORDER BY  b.id DESC
            LIMIT  18446744073709551615
    ) AS tmp
    GROUP BY  clli
    ORDER BY  bid DESC` 

… and that worked mostly all right. What I am trying to do is get one record per clli column, the newest once for each (hence with the highest ID). This gives me the newest record for every clli value there is. I do like the single subquery way of doing it as opposed to inner joining on a table with max values.

A slight problem is that the clli values sometimes (but rarely) change. In that case, I don't want the clli values that have been superseded. There is "effectively" a foreign key to the users table here, so I can associate clli values with user IDs. Of course, GROUP BY clli, id will give me every unique clli/id pairing, giving me more rows, the opposite of what I want. So, I nested the subqueries and did another GROUP BY to get only the latest clli per user ID.

Is there a way to optimize out the second nested subquery surrounding the first one? It seems like there should also be a way to combine the GROUP BYs in the other direction.

At 60,000 rows, and the billing table being much larger than users, this doesn't perform badly, but it just seems like a bad query to me. Can the second subquery be optimized out, or is this as good as I'll get with this approach?

Here is table structure:

describe billing;
+---------------+------------------+------+-----+-------------------+----------------+
| Field         | Type             | Null | Key | Default           | Extra          |
+---------------+------------------+------+-----+-------------------+----------------+
| id            | int(11)          | NO   | PRI | NULL              | auto_increment |
| ani           | varchar(32)      | NO   | MUL | NULL              |                |
| called_number | int(7)           | NO   |     | NULL              |                |
| chargedfrom   | varchar(32)      | NO   |     | NULL              |                |
| switchowner   | int(11) unsigned | NO   |     | NULL              |                |
| clli          | varchar(24)      | NO   |     | NULL              |                              |
| rt_time       | datetime         | NO   |     | CURRENT_TIMESTAMP |                |
+---------------+------------------+------+-----+-------------------+----------------+

describe users;
+-----------------+-------------+------+-----+-------------------+----------------+
| Field           | Type        | Null | Key | Default           | Extra          |
+-----------------+-------------+------+-----+-------------------+----------------+
| id              | int(11)     | NO   | PRI | NULL              | auto_increment |               |
| city            | varchar(32) | NO   |     | NULL              |                |
| state           | varchar(2)  | NO   |     | NULL              |                
+-----------------+-------------+------+-----+-------------------+----------------+

I've omitted rows that are not relevant / used in these queries.

It isn't officially part of the structure, but billing.switchowner is used as a foreign key with users.id.

Here is a DB fiddle: https://www.db-fiddle.com/f/7y19T29CeykBSPFb7oJq9M/0

If that doesn't work, this will set up the fiddle:

CREATE TABLE `users` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `city` varchar(32) NOT NULL,
 `state` varchar(2) NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;
CREATE TABLE `billing` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `switchowner` int(11) unsigned NOT NULL,
 `clli` varchar(24) NOT NULL,
 `rt_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;
INSERT INTO `users` (city, state) VALUES ("New York", "NY");
INSERT INTO `users` (city, state) VALUES ("Wilmington", "DE");
INSERT INTO `users` (city, state) VALUES ("San Francisco", "CA");
INSERT INTO `users` (city, state) VALUES ("Phoenix", "AZ");
INSERT INTO `users` (city, state) VALUES ("Houston", "TX");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK-01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK-01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("4", "PHNXAZ01");
INSERT INTO `billing` (switchowner, clli) VALUES ("4", "PHNXAZ01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("5", "HTX01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("2", "WILMDE01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("3", "SANFRAN01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("2", "WILMDE01");
INSERT INTO `billing` (switchowner, clli) VALUES ("5", "HTX01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("3", "SANFRAN01");
INSERT INTO `billing` (switchowner, clli) VALUES ("4", "PHNXAZ01");
INSERT INTO `billing` (switchowner, clli) VALUES ("3", "SANFRAN01");
INSERT INTO `billing` (switchowner, clli) VALUES ("1", "NEWYRK01");
INSERT INTO `billing` (switchowner, clli) VALUES ("3", "SANFRAN01");
SELECT * FROM (SELECT COUNT(*) AS calls, tmp.* FROM (
SELECT b.id AS bid, u.id AS uid, b.rt_time, b.clli, u.city, u.state FROM billing
AS b LEFT JOIN users AS u ON b.switchowner = u.id
ORDER BY b.id DESC LIMIT 18446744073709551615)
AS tmp GROUP BY clli ORDER BY bid DESC LIMIT 18446744073709551615)
AS tmp2 GROUP BY uid ORDER BY bid DESC;```

Best Answer

Your query is not valid SQL, it is only allowed if ONLY_FULL_GROUP_BY is missing from @@sql_mode. I moved your example to Fiddle1 and tried. Note that I set

set @@sql_mode=concat(@@sql_mode, ',ONLY_FULL_GROUP_BY');

and as a consequence 'tmp.bid' isn't in GROUP BY is the error message you get from your query.

But, let's enable partial GROUP BY and see where that leads us. Your innermost query:

SELECT b.id AS bid, u.id AS uid, b.rt_time, b.clli, u.city, u.state 
FROM billing AS b 
LEFT JOIN users AS u 
    ON b.switchowner = u.id
ORDER BY b.id DESC LIMIT 18446744073709551615;

returns 23 rows. So far so good. If we add another layer of nesting:

SELECT tmp.*, COUNT(*) AS calls FROM (
    SELECT b.id AS bid, u.id AS uid, b.rt_time, b.clli, u.city, u.state 
    FROM billing AS b 
    LEFT JOIN users AS u 
        ON b.switchowner = u.id
    ORDER BY b.id DESC LIMIT 18446744073709551615
) AS tmp 
GROUP BY clli

This seems to work, but there is no rational explanation to why it should (AFAIK). Unless you find it clearly documented that it is guaranteed to work, I would not trust the result. Will it work when you upgrade to the next version?

Starting with 10.2 you can use window functions. They are a much better alternative if you can upgrade to that version:

SELECT bid, uid, rt_time, clli, city, state, calls
FROM (
    SELECT b.id AS bid, u.id AS uid, b.rt_time, b.clli, u.city, u.state
         , row_number() over (partition by b.clli order by b.id desc) as rn
         , count(1) over (partition by b.clli) as calls
    FROM billing AS b 
    LEFT JOIN users AS u 
        ON b.switchowner = u.id
) as tmp
WHERE rn = 1
ORDER BY clli, bid;

Fiddle for MariaDB 10.3