I'll base this question on a toy example.
Let this be table A
:
A
-------------------------
U | V | W | X | Y | Z
-------------------------
a | b | c | 1 | 6 | 8.3
a | b | c | 1 | 4 | 3.7
a | b | f | 3 | 4 | 2.6
a | b | f | 3 | 2 | 6.0
a | e | c | 1 | 0 | 3.5
a | e | c | 1 | 5 | 8.8
d | b | f | 1 | 0 | 3.5
d | b | f | 1 | 3 | 2.3
d | e | c | 2 | 6 | 2.2
d | e | c | 2 | 4 | 3.3
d | e | f | 0 | 7 | 5.0
d | e | f | 0 | 6 | 3.6
I can produce a second table B
by grouping the rows of A
by columns U
, V
, and W
, and computing the average of column Z for each group.
B
-------------------
U | V | W | Z_avg
-------------------
a | b | c | 6.0
a | b | f | 4.3
a | e | c | 6.2
d | b | f | 2.9
d | e | c | 2.7
d | e | f | 4.3
The SQL for this would be something like
SELECT U, V, W, AVG(Z) AS Z_avg FROM A GROUP BY U, V, W;
But I want the new table to include all the columns of the original table that have a functional dependence on the grouping columns U
, V
, and W
. In this example there is one such column, namely column X
.
In other words, I want to generate the table C
shown below:
C
-----------------------
U | V | W | X | Z_avg
-----------------------
a | b | c | 1 | 6.0
a | b | f | 3 | 4.3
a | e | c | 1 | 6.2
d | b | f | 1 | 2.9
d | e | c | 2 | 2.7
d | e | f | 0 | 4.3
So this problem has two parts, at least conceptually.
-
How to determine which columns are functionally dependent on
columnsU
,V
, andW
? -
What is the SQL to generate table
C
?
I know how to implement a (say, Python) script that can answer (1), but it is tedious and slow. (Basically, for each of the candidate columns, in this case X
and Y
, the script would collect all of its values for each distinct combination of values in columns U
, V
, and Z
, and then, if each of these sets of values has exactly one element, then the column is functi\onally related to U
, V
, and Z
.)
Likewise, once I have identfied the functionally dependent columns, I can muddle may way through (using temporary tables and what not) to eventually end up with something like table C
above (thus, effectively solving (2)).
I figure, however, that this task is sufficiently common that there may be standard tools/techniques to carry it out.
Best Answer
Under the assumptions that you have some columns (say
x
andy
in your example), that you don't know if they are functionally equivalent or not - and that these columns do not have anyNULL
values (which would complicate things), you can use:If a column is not functionally dependent (with the current data), it will have at least 2 distinct values, so the
MIN
andMAX
will be different and the result will be shown asNULL
in that column, in all rows. You can then remove it from the query.