MySQL: How to perform aggregate functions on rows where a column is duplicate, for all values of said column

MySQL

I have a table where one of the columns may have duplicate entries, and I want to condense each set of duplicates into a single row, with an aggregate function applied to the other columns. I cannot think of a sensible explanation of this, so here is an example data table and result table

+------------------------+--------------------+
| column_with_duplicates | column_to_aggregate|
+------------------------+--------------------+
| 1                      | 1                  |
| 1                      | 2                  |
| 2                      | 3                  |
| 3                      | 4                  |
| 3                      | 5                  |
| 3                      | 6                  |
+---------------------------------------------+
output

+------------------------+--------------------+
| column_with_duplicates | column_to_aggregate|
+------------------------+--------------------+
| 1                      | 1.5                |
| 2                      | 2                  |
| 3                      | 5                  |
+---------------------------------------------+

All of the rows where column_with_duplicates is 1 had column_to_aggregate averaged together into a result row. The same happened for all values of column_with_duplicates

Since I had so much difficulty describing this, I couldn't effectively research the issue.

Best Answer

It seems like your query needs to be something like:

SELECT 
  column_with_duplicates
  , AVG(column_to_aggregate) AS column_to_aggregate -- AVG does the math, AS gives the name
FROM <table name> 
GROUP BY 
  column_with_duplicates;

More information on creating aliases (the AS keyword).

More information on GROUP BY

More information on the AVG function with a handy example that would have served you well.

Since you had a question about the GROUP BY clause in the comments below...

As I understand it MySQL has some quirks with it's GROUP BY rules. Those quirks not withstanding, I think of a simple GROUP BY clause, like the one in the query above, as sorting the result set into buckets. So in our query above there would be one bucket for each unique value in the column_with_duplicates column. So each value in the column_to_aggregate would get dropped into the proper bucket. Once all the rows/values are sorted into their buckets then an average is taken of all the values in each bucket individually.

Of course this is just a conceptual model, but it should give you a basic understanding of the role of a GROUP BY clause.

PROPOSED SOLUTION

SET @ndx = 0;
SET @cur_hash = '';
SELECT entry_id,CONCAT(url_title,IF(ndx>0,CONCAT('-',ndx),'')) url_title
FROM (SELECT
   @new_hash := MD5(url_title),
   (@ndx := IF(@new_hash=@cur_hash,@ndx+1,0)) ndx,
   @cur_hash := MD5(url_title),
   url_title,entry_id
FROM (SELECT url_title,entry_id FROM aaronlax ORDER BY url_title) AA) A;

SAMPLE DATA

use test
DROP TABLE IF EXISTS aaronlax;
CREATE TABLE aaronlax
(
    entry_id int(10) unsigned NOT NULL AUTO_INCREMENT,
    url_title varchar(75) NOT NULL,
    PRIMARY KEY (entry_id),
    KEY (url_title)    
) ENGINE=MyISAM;
INSERt INTO aaronlax (url_title) VALUES
('a-stoney'),('a-stoney'),
('ab-dille'),('ab-dille'),
('ab-dille'),('ac-gifta'),
('ae-jones'),('ae-jones');

SAMPLE DATA LOADED

mysql> use test
Database changed
mysql> DROP TABLE IF EXISTS aaronlax;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE aaronlax
    -> (
    ->     entry_id int(10) unsigned NOT NULL AUTO_INCREMENT,
    ->     url_title varchar(75) NOT NULL,
    ->     PRIMARY KEY (entry_id),
    ->     KEY (url_title)
    -> ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.02 sec)

mysql> INSERt INTO aaronlax (url_title) VALUES
    -> ('a-stoney'),('a-stoney'),
    -> ('ab-dille'),('ab-dille'),
    -> ('ab-dille'),('ac-gifta'),
    -> ('ae-jones'),('ae-jones');
Query OK, 8 rows affected (0.00 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> select * from aaronlax;
+----------+-----------+
| entry_id | url_title |
+----------+-----------+
|        1 | a-stoney  |
|        2 | a-stoney  |
|        3 | ab-dille  |
|        4 | ab-dille  |
|        5 | ab-dille  |
|        6 | ac-gifta  |
|        7 | ae-jones  |
|        8 | ae-jones  |
+----------+-----------+
8 rows in set (0.00 sec)

mysql>

PROPOSED SOLUTION EXECUTED

mysql> SET @ndx = 0;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @cur_hash = '';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT entry_id,CONCAT(url_title,IF(ndx>0,CONCAT('-',ndx),'')) url_title
    -> FROM (SELECT
    ->    @new_hash := MD5(url_title),
    ->    (@ndx := IF(@new_hash=@cur_hash,@ndx+1,0)) ndx,
    ->    @cur_hash := MD5(url_title),
    ->    url_title,entry_id
    -> FROM (SELECT url_title,entry_id FROM aaronlax ORDER BY url_title) AA) A;
+----------+------------+
| entry_id | url_title  |
+----------+------------+
|        1 | a-stoney   |
|        2 | a-stoney-1 |
|        3 | ab-dille   |
|        4 | ab-dille-1 |
|        5 | ab-dille-2 |
|        6 | ac-gifta   |
|        7 | ae-jones   |
|        8 | ae-jones-1 |
+----------+------------+
8 rows in set (0.00 sec)

mysql>

GIVE IT A TRY !!!

CAVEAT

In case you are wondering why the proposed solution works, let's take a look at the subqueries

INNERMOST SUBQUERY

mysql> SELECT url_title,entry_id FROM aaronlax ORDER BY url_title;
+-----------+----------+
| url_title | entry_id |
+-----------+----------+
| a-stoney  |        1 |
| a-stoney  |        2 |
| ab-dille  |        3 |
| ab-dille  |        4 |
| ab-dille  |        5 |
| ac-gifta  |        6 |
| ae-jones  |        7 |
| ae-jones  |        8 |
+-----------+----------+
8 rows in set (0.00 sec)

mysql>

NEXT LEVEL SUBQUERY

mysql> SELECT
    ->    @new_hash := MD5(url_title),
    ->    (@ndx := IF(@new_hash=@cur_hash,@ndx+1,0)) ndx,
    ->    @cur_hash := MD5(url_title),
    ->    url_title,entry_id
    -> FROM (SELECT url_title,entry_id FROM aaronlax ORDER BY url_title) AA;
+----------------------------------+------+----------------------------------+-----------+----------+
| @new_hash := MD5(url_title)      | ndx  | @cur_hash := MD5(url_title)      | url_title | entry_id |
+----------------------------------+------+----------------------------------+-----------+----------+
| 911adc6db360f218da5069b20cfb1917 |    0 | 911adc6db360f218da5069b20cfb1917 | a-stoney  |        1 |
| 911adc6db360f218da5069b20cfb1917 |    1 | 911adc6db360f218da5069b20cfb1917 | a-stoney  |        2 |
| b468c0dd0ef78fde7cf5ae14765f9a5a |    0 | b468c0dd0ef78fde7cf5ae14765f9a5a | ab-dille  |        3 |
| b468c0dd0ef78fde7cf5ae14765f9a5a |    1 | b468c0dd0ef78fde7cf5ae14765f9a5a | ab-dille  |        4 |
| b468c0dd0ef78fde7cf5ae14765f9a5a |    2 | b468c0dd0ef78fde7cf5ae14765f9a5a | ab-dille  |        5 |
| 3fff0367af8bf5d3b3b0089b8f187c04 |    0 | 3fff0367af8bf5d3b3b0089b8f187c04 | ac-gifta  |        6 |
| 1909ba0f7a57ca081c82d2f7ba54c9e5 |    0 | 1909ba0f7a57ca081c82d2f7ba54c9e5 | ae-jones  |        7 |
| 1909ba0f7a57ca081c82d2f7ba54c9e5 |    1 | 1909ba0f7a57ca081c82d2f7ba54c9e5 | ae-jones  |        8 |
+----------------------------------+------+----------------------------------+-----------+----------+
8 rows in set (0.00 sec)

mysql>

From this level, I only need three columns: entry_id, url_title, ndx in the outer most level

Then, I append to the url_title a hyphen and ndx if ndx is nonzero

Mysql – Is a 5+ column primary key bad for large (100 million+) table

There are performance issues with very complex primary keys. And It may not be defending against duplication as well as a simpler primary key might.

However, there is one design pattern that frequently yields tables with a primary key made up of six or so components. It's star schema fact tables. If the fact table of a star schema has six dimensions, then the primary key will have six components. I've never seen a fact table with no declared primary key, and I think it's well worth the overhead, even though the ETL process still has to be quite carefully written.

Some reporting databases imitate the pattern of star schema even if it's not explicitly designed that way.

100 million + rows is not overly big for a fact table, especially with today's big data.