What you are asking is a little daunting. Here is why:
Would it be faster to store a hash of the value as well and instead index and search on that?
Creating a hash column and indexing sounds like a great idea. I have suggested that back on March 03, 2013
: Possible INDEX on a VARCHAR field in MySql (See Suggestion #3)
Does that even make sense if the values are not guaranteed to be unique?
This would depend on the cardinality of that hash column. Since you said you will have millions of rows, let me express this in numerical terms:
Run SELECT COUNT(DISTINCT hashcolumn) ...
against the table. For a one million row table, this count should be greater that 20. In other words, each distinct value should have no more that 50,000 rows (5% of the table rows). Any value that has more that 50,000 rows will cause the MySQL Query Optimizer to dismiss the index from being used and make a full table scan the preferred method for that hash value.
If hashing the values gives them a consistent length, would indexing that make queries faster?
I would say Yes and Perhaps at the same time. Why two answers? Indexing and using a hash column in place of a long column sounds brilliant against a MyISAM Table. You said you are using InnoDB.
When it comes to using Fixed vs Variable text, I would go with MyISAM over InnoDB
EPILOGUE
If the table is fairly-to-heavily used in Transactions, the table must stay InnoDB. You can take better advantage of your idea in MyISAM. You can go forward with the hash idea. Please make sure the PRIMARY KEY is single integer column (BIGINT
if you know you will exceed 2 billion rows. Otherwise, INT). I would do a major RAM upgrade and increase the InnoDB Buffer Pool size accordingly.
This is just a guess, as I do not have all info, but you probably would be better by doing:
EXPLAIN SELECT STRAIGHT_JOIN
*
FROM
tusers PARTITION (p362) tu
JOIN users PARTITION (p362) u
ON u.group_id=tu.group_id
AND tu.email_address=u.email
AND tu.group_id = 362
WHERE
tu.application_id=253555;
Note the STRAIGHT_JOIN
, that may not be needed -if it is needed, then I may have assumed wrongly- and the tu.group_id
comparison (that, again, shouldn't be needed).
Then using the following keys:
(tu.application_id, tu.group_id, tu.email_address)
(u.group_id, u.email)
However, if the number of records to be returned is 2.5M, as your cardinality suggests, then do not expect this to be fast... this is a pure IO math.
There are many other things that clicks me as problems, but I cannot say for sure without access.
Those could be even more effective if you didn't do a SELECT *
.
Another thing is that varchar(255) is usually a bad idea.
Best Answer
I have discussed having large PRIMARY KEYs for InnoDB before : What storage engine should I use for this MySQL table?. The effect would be bloated keys experiencing linear growth of all Secondary Indexes because the PRIMARY KEY reference from a Secondary Key Index Entry would also start getting bloated as well.
Looking at your question, you yourself are saying you will be expanding an email address from 20 to 64 characters. You will be bloating every non-unique index for the table in question that has an email address column. If the email address is itself the primary key, then all indexes for the table in question whether an index has an email address column or not. There are other viewpoints discussed in Mysql int vs varchar as primary key (InnoDB Storage Engine?
It is bad enough that a CHAR field is fast to read than a VARCHAR field but at the expense of having larger indexes (See my post What is the performance impact of using CHAR vs VARCHAR on a fixed-size field?). Doing this will certainly introduce configuration challenges including
Even if email address is not indexed, data pages for that InnoDB will still have bloating and possibly fragmentation nonetheless. Since you are hashing, I can presume you must be indexing it.