Mysql – Obtaining the average of three columns per row in SQL

aggregateMySQL

The AVG() function in SQL works particular column data. But here, we want to calculate the average of three such columns for each row. In math, we would do

AVG=(col1 + col2 + col3)/3

Similarly: is there any query to calculate AVG(col1, col2, col3...)?

Best Answer

If the columns aren't nullable then simply using

(col1 + col2 + col3)/3

will work fine (though on some RDBMSs you might need to have a non integer numerator or divisor to avoid integer division).

For nullable columns you might want to use something like

SELECT CASE
         WHEN COALESCE(col1, col2, col3) IS NOT NULL THEN 
         ( COALESCE(col1, 0) + COALESCE(col2, 0) + COALESCE(col3, 0) ) / 
            (CASE WHEN col1 IS NULL THEN 0 ELSE 1 END + 
             CASE WHEN col2 IS NULL THEN 0 ELSE 1 END + 
             CASE WHEN col3 IS NULL THEN 0 ELSE 1 END)
       END

On SQL Server you could also use

SELECT *,
       (SELECT AVG(Col)
        FROM   (VALUES(Col1),
                      (Col2),
                      (Col3)) V(Col)) AS col_average
FROM   YourTable

Related Solutions

Sql-server – the most efficient way to get the minimum of multiple columns on SQL Server 2005

I tested the performance of all 3 methods, and here's what I found:

1 record: No noticeable difference
10 records: No noticeable difference
1,000 records: No noticeable difference
10,000 records: UNION subquery was a little slower. The CASE WHEN query is a little faster than the UNPIVOT one.
100,000 records: UNION subquery is significantly slower, but UNPIVOT query becomes a little faster than the CASE WHEN query
500,000 records: UNION subquery still significantly slower, but UNPIVOT becomes much faster than the CASE WHEN query

So the end results seems to be

With smaller record sets there doesn't seem to be enough of a difference to matter. Use whatever is easiest to read and maintain.
Once you start getting into larger record sets, the UNION ALL subquery begins to perform poorly compared to the other two methods.
The CASE statement performs the best up until a certain point (in my case, around 100k rows), and which point the UNPIVOT query becomes the best-performing query

The actual number at which one query becomes better than another will probably change as a result of your hardware, database schema, data, and current server load, so be sure to test with your own system if you're concerned about performance.

I also ran some tests using Mikael's answer; however, it was slower than all 3 of the other methods tried here for most recordset sizes. The only exception was it did better than a the UNION ALL query for very large recordset sizes. I like the fact it shows the column name in addition to the smallest value though.

I'm not a dba, so I may not have optimized my tests and missed something. I was testing with the actual live data, so that may have affected the results. I tried to account for that by running each query a few different times, but you never know. I would definitely be interested if someone wrote up a clean test of this and shared their results.

How to Get Aggregate of a Window Function in PostgreSQL

You can have window functions on the result of aggregate functions in a single query level.

This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.

SELECT perm
      ,combo
      ,avg(value)                 AS perm_average_value
      ,sum(avg(value) * count(*)) OVER w_combo /
       sum(count(*)) OVER w_combo AS combo_average_value
      ,stddev_pop(value)          AS perm_stddev
      ,0                          AS combo_stddev  -- doesn't work!
      ,count(*)                   AS perm_count
      ,sum(count(*)) OVER w_combo AS combo_count
FROM   foo
GROUP  BY perm, combo
WINDOW w_combo  AS (PARTITION BY combo);

For combo_average_value you would need this expression

sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo

Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)

This works:

SELECT DISTINCT ON (perm, combo)
       perm
      ,combo
      ,avg(value)        OVER wpc AS perm_average_value
      ,avg(value)        OVER wc  AS combo_average_value
      ,stddev_pop(value) OVER wpc AS perm_stddev
      ,stddev_pop(value) OVER wc  AS combo_stddev
      ,count(*)          OVER wpc AS perm_count
      ,count(*)          OVER wc  AS combo_count
FROM   foo
WINDOW wc  AS (PARTITION BY combo)
      ,wpc AS (PARTITION BY perm, combo);

I am using two different windows here, and reduce the rows with DISTINCT which is applied even after window functions.

But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.

Better performance with altered table layout

Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:

CREATE TABLE combo ( 
  combo_id serial PRIMARY KEY
 ,combo    int[] NOT NULL
);

CREATE TABLE perm ( 
  perm_id  serial PRIMARY KEY
 ,perm     int[] NOT NULL
);

CREATE TABLE value (
  perm_id  int REFERENCES perm(perm_id)
 ,combo_id int REFERENCES combo(combo_id)
 ,value numeric NOT NULL DEFAULT 0
);

If you don't need referential integrity you can omit the foreign key constraints.

The connection to combo_id could also be placed in the table perm, but in this scenario I would store it (slightly de-normalized) in value for better performance.

This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your numeric column. (If you don't need extreme precision, a double precision or even a real column might do, too.)

More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance

Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.

You would first aggregate in a subquery and then join to perm and combo for best performance.

Best Answer

Related Solutions

Sql-server – the most efficient way to get the minimum of multiple columns on SQL Server 2005

How to Get Aggregate of a Window Function in PostgreSQL

Better performance with altered table layout

Related Question