Mysql – Summing a column based on uniqueness of another column without using temporary table

MySQLperformancequery-performancesum

I have a table like this:

CREATE TABLE `files` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `hash` char(40) DEFAULT NULL,
  `size` bigint(20) unsigned DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `ndx_hash` (`hash`),
  KEY `hash_and_size_ndx` (`hash`,`size`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8

This table has files that contains same (hash, size) with different name in it.

It should be noted that two file records containing the same hash will have the same size.

What I'm looking for is the total size of unique files in this table. Uniqueness is based on hash column.

Here is a query that provides what I want:

SELECT sum(t.unique_size) FROM (SELECT MIN(size) as unique_size FROM files GROUP BY hash) AS t;

Output is:

+--------------------+
| sum(t.unique_size) |
+--------------------+
|           64233241 |
+--------------------+

The reason I don't like this query is because its slow. Its slow because it creates a temporary table. My table has more than 5m records in it.

I think it is a waste to create a giant temporary table since it is theoretically possible to allocate one integer field and increment it as the unique_size's (see my query) are generated.

Is there a way to get what I want without creating a temporary table?

Note: I've experimented WITH ROLLUP and found no use of it.

Best Answer

I would suggest normalizing data and splitting the table into the following two tables:

CREATE TABLE `file_data` (
`data_id` bigint(20) NOT NULL AUTO_INCREMENT,
`hash` char(40) DEFAULT NULL,
`size` bigint(20) unsigned DEFAULT '0',
PRIMARY KEY (`data_id`),
UNIQUE (`hash`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

CREATE TABLE `file_names` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`data_id` bigint(20) NOT NULL,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (data_id) REFERENCES file_data(data_id)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

This eliminates the possibility of inconsistency between hash and size fields, and the total size of all (unique) files can be obtained with the query:

select sum(size) from file_data; 

If you need the original (denormalized) structure, you can perform a simple join:

select id, name, hash, size 
from file_names n join file_data d on d.data_id = n.data_id;