Include All Columns in SELECT GROUP BY Result – How to Guide

aggregatedependenciesgroup byprimary-key

I'll base this question on a toy example.

Let this be table A:

            A
-------------------------
 U | V | W | X | Y |  Z
-------------------------
 a | b | c | 1 | 6 | 8.3
 a | b | c | 1 | 4 | 3.7
 a | b | f | 3 | 4 | 2.6
 a | b | f | 3 | 2 | 6.0
 a | e | c | 1 | 0 | 3.5
 a | e | c | 1 | 5 | 8.8
 d | b | f | 1 | 0 | 3.5
 d | b | f | 1 | 3 | 2.3
 d | e | c | 2 | 6 | 2.2
 d | e | c | 2 | 4 | 3.3
 d | e | f | 0 | 7 | 5.0
 d | e | f | 0 | 6 | 3.6

I can produce a second table B by grouping the rows of A by columns U, V, and W, and computing the average of column Z for each group.

         B
-------------------
 U | V | W | Z_avg
-------------------
 a | b | c |  6.0
 a | b | f |  4.3
 a | e | c |  6.2
 d | b | f |  2.9
 d | e | c |  2.7
 d | e | f |  4.3

The SQL for this would be something like

SELECT U, V, W, AVG(Z) AS Z_avg FROM A GROUP BY U, V, W;

But I want the new table to include all the columns of the original table that have a functional dependence on the grouping columns U, V, and W. In this example there is one such column, namely column X.

In other words, I want to generate the table C shown below:

           C
-----------------------
 U | V | W | X | Z_avg
-----------------------
 a | b | c | 1 |  6.0
 a | b | f | 3 |  4.3
 a | e | c | 1 |  6.2
 d | b | f | 1 |  2.9
 d | e | c | 2 |  2.7
 d | e | f | 0 |  4.3

So this problem has two parts, at least conceptually.

  1. How to determine which columns are functionally dependent on
    columns U, V, and W?

  2. What is the SQL to generate table C?

I know how to implement a (say, Python) script that can answer (1), but it is tedious and slow. (Basically, for each of the candidate columns, in this case X and Y, the script would collect all of its values for each distinct combination of values in columns U, V, and Z, and then, if each of these sets of values has exactly one element, then the column is functi\onally related to U, V, and Z.)

Likewise, once I have identfied the functionally dependent columns, I can muddle may way through (using temporary tables and what not) to eventually end up with something like table C above (thus, effectively solving (2)).

I figure, however, that this task is sufficiently common that there may be standard tools/techniques to carry it out.

Best Answer

Under the assumptions that you have some columns (say x and y in your example), that you don't know if they are functionally equivalent or not - and that these columns do not have any NULL values (which would complicate things), you can use:

SELECT 
    u, v, w,                    -- the grouping columns

    AVG(z) AS z_avg,            -- the non-functionally dependent

    CASE WHEN MIN(x) = MAX(x)   -- a possibly functionally dependent
      THEN MIN(x) ELSE NULL     -- column
    END AS x,

    CASE WHEN MIN(y) = MAX(y)   -- another one
      THEN MIN(y) ELSE NULL
    END AS y
FROM 
    a
GROUP BY 
    u, v, w ;

If a column is not functionally dependent (with the current data), it will have at least 2 distinct values, so the MIN and MAX will be different and the result will be shown as NULL in that column, in all rows. You can then remove it from the query.