Mysql – Is it possible to reduce a group to a row based on some criteria

aggregategroup byMySQL

I'd like to perform a select on a table involving a GROUP BY such that all rows that share the same set of identifiers are grouped together, but I want to reduce the group to one of the grouped rows based on some criteria. For example, the maximum date_added. However, there are other fields of data that could be different among the grouped rows. I want all of those columns to resolve to the row with the max date_added as well.

I realize to get the max date_added I could simply SELECT MAX(date_added), but that is just a column-level aggregate function. Is there any way I can resolve the entire row in a group?

Conceptually, if you imagine each group as a separate table, I want to SELECT * WHERE date_added=(SELECT MAX(date_added)) from that group table.

Best Answer

The traditional solution, the one you may find in books, is to do a self join: first find that "max date per group" of yours, then join to self table on rows with said max date.

However, some hacks allow you to avoid that. Consider the following query:

SELECT
  MAX(date_added) AS date_added,
  SUBSTRING_INDEX(GROUP_CONCAT(some_column ORDER BY date_added DESC), ',', 1) AS some_column,
  SUBSTRING_INDEX(GROUP_CONCAT(another_column ORDER BY date_added DESC), ',', 1) AS another_column
FROM t
GROUP BY whatever

GROUP_CONCAT is an aggregation function which implodes values onto one string. It allows for ORDER BY, which we utilize via ORDER BY date_added DESC so as to implode our desired value first. We then slice up the first token in the string via SUBSTRING_INDEX.

The downside here (apart from making the query quite the frightening appearance) is that your numerical values are transformed into texts. Typically no big deal with SQL, but please be aware.

There's another option where you do a semi-self-join, a much lighter one; you will have to give up usage of index. It's quite long to describe; it still uses GROUP_CONCAT and SUBSTRING_INDEX, but only for the purpose of creating a derived table with only relevant keys. This derived table is then joined with original table. See an example in SQL: selecting top N records per group, another solution.

Schema

The translated schema could look like this:

CREATE TABLE log (
  id serial PRIMARY KEY
, dst_port int
, src_ip inet
, dst_ip inet
);
CREATE INDEX ON log (dst_port);
CREATE INDEX ON log (src_ip);

I moved to dst_port int to the 2nd position to optimize alignment / padding:

Configuring PostgreSQL for read performance

Now we can use standard window functions (not possible in MySQL).

Step 1: Fold groups of consecutive `dst_ip` for same (`dst_port`)

One special difficulty: The aggregate function min() / max() are not yet implemented for inet in Postgres 9.4. Both are in the upcoming Postgres 9.5!

So I substituted with DISTINCT ON in the first step:

Select first row in each GROUP BY group?

SELECT DISTINCT ON (dst_port, ip_grp)
       dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM  (
   SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                        ORDER BY dst_ip) AS ip_grp
   FROM   log
   ORDER  BY dst_port, dst_ip
   ) sub
ORDER  BY dst_port, ip_grp, dst_ip;

Result as desired - with a count of rows (could be upper IP as well).

You can subtract/add integer from/to the inet type. By subtracting the row_number() all consecutive rows get the same grp - the value of grp is irrelevant, just the fast that it's the same per partition (dst_port).

Then we can GROUP BY ... - or in this special case DISTINCT ON dst_port, ip_grp. I use another window function to get the count ip_ct in the same step: count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct.

Note that consecutive IPs can cross byte boundaries (see my comment to question).

Detailed explanation for this technique:

Select longest continuous sequence

Step 2: Fold groups of consecutive `dst_port` for same `(dst_ip, ip_ct)`

SELECT dst_ip, ip_ct, min(dst_port) AS dst_port, count(*) AS port_ct
FROM  (
   SELECT *, dst_port - row_number() OVER (PARTITION BY dst_ip, ip_ct
                                           ORDER BY dst_port) AS port_grp
   FROM  (
      SELECT DISTINCT ON (dst_port, ip_grp)
             dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
      FROM  (
         SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                              ORDER BY dst_ip) AS ip_grp
         FROM   log
         ORDER  BY dst_port, dst_ip
         ) sub1
      ORDER  BY dst_port, ip_grp, dst_ip
      ) sub2
   ) sub3
GROUP  BY 1, 2, port_grp
ORDER  BY 1, 3, 2;

Basically, repeat the same logic like in the first step, applied to the result of the first step.
But now you have to group on ip_ct additionally. And this time, you can use the simpler min(dst_port), since the port number is a plain integer.

SQL Fiddle demonstrating all.

Best Answer

Related Solutions

Column ‘Comments.Text’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

MySQL – How to Group by Maximum Consecutive Row

Schema

Step 1: Fold groups of consecutive dst_ip for same (dst_port)

Step 2: Fold groups of consecutive dst_port for same (dst_ip, ip_ct)

Related Question

Step 1: Fold groups of consecutive `dst_ip` for same (`dst_port`)

Step 2: Fold groups of consecutive `dst_port` for same `(dst_ip, ip_ct)`