I have a table like this:
CREATE TABLE `files` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`hash` char(40) DEFAULT NULL,
`size` bigint(20) unsigned DEFAULT '0',
PRIMARY KEY (`id`),
KEY `ndx_hash` (`hash`),
KEY `hash_and_size_ndx` (`hash`,`size`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
This table has files that contains same (hash
, size
) with different name
in it.
It should be noted that two file records containing the same hash
will have the same size
.
What I'm looking for is the total size of unique files in this table. Uniqueness is based on hash
column.
Here is a query that provides what I want:
SELECT sum(t.unique_size) FROM (SELECT MIN(size) as unique_size FROM files GROUP BY hash) AS t;
Output is:
+--------------------+
| sum(t.unique_size) |
+--------------------+
| 64233241 |
+--------------------+
The reason I don't like this query is because its slow. Its slow because it creates a temporary table. My table has more than 5m records in it.
I think it is a waste to create a giant temporary table since it is theoretically possible to allocate one integer field and increment it as the unique_size
's (see my query) are generated.
Is there a way to get what I want without creating a temporary table?
Note: I've experimented WITH ROLLUP
and found no use of it.
Best Answer
I would suggest normalizing data and splitting the table into the following two tables:
This eliminates the possibility of inconsistency between
hash
andsize
fields, and the total size of all (unique) files can be obtained with the query:If you need the original (denormalized) structure, you can perform a simple
join
: