You do not want a gigantic gen_clust_index (Internal Clustered Index). That size is ungodly huge even for a secondary index.
You may have to resort to triggers or stored procedures to check for the key well in advance.
You could also think about performing an SHA1 function call using the VARCHAR(3071)
field. SHA1 will return a 40-character field. This hash may be just what you need to index.
Suppose you have this
CREATE TABLE mytable
(
id int not null auto_increment,
txt VARCHAR(3071),
primary key (id)
) ENGINE=InnODB;
and you want to make a UNIQUE
index on txt. Try the SHA1 approach
CREATE TABLE mytablenew LIKE mytable;
ALTER TABLE mytable ADD txtsha1 CHAR(40);
ALTER TABLE mytable ADD UNIQUE KEY (txtsha1);
INSERT INTO mytablenew (id,txt,txtsha1)
SELECT id,txt,SHA1(txt) FROM mytable;
Then, count them
SELECT COUNT(1) FROM mytable;
SELECT COUNT(1) FROM mytablenew;
If the Counts are the Same, CONGRATULATIONS !!! Now you have a unique index of length 40. You can finish up with:
ALTER TABLE mytable RENAME mytableold;
ALTER TABLE mytablenew RENAME mytable;
DROP TABLE mytableold;
This could be more atomically as pointed out in the comments below:
RENAME TABLE mytable TO mytableold, mytablenew TO mytable;
DROP TABLE mytableold;
Perform this on whatever table you intend to have this big column. You have to remember to add the SHA1 of the data along with the data upon INSERT
.
The odds of duplicate keys is 1 in 2 to the 160th power (that 1.4615016373309029182036848327163e+48. If I get the exact figure, I'll post it someday).
Give it a Try !!!
Several ideas come to mind:
The hash table could be a memory table, once your inserts are done, you could then copy just the domains that don't already exist into the permanent domain_list table. Having your new data in memory might also make the domain name comparison between the hash table and the domain_list tables go faster.
Another option would be to use the innodb_file_per_table option and keep the database (idb) files on a faster disk.
A third option would be to load your entire dataset into a memory table and then convert that table to InnoDB using the ALTER TABLE table_name ENGINE=InnoDB
. This will be much faster than inserting into InnoDB directly. This is possible since you control how and when data is inserted.
Best Answer
We keep giving you answers that do not directly answer the question, because that is how we solve this problem. An index of unlimited length is impractical and inefficient, but a unique hash provides a solution that sufficient to the task because of the astronomically low likelihood of a meaningful collision.
Similar to the other offered solutions, my standard approach does not check for duplicates up front -- it is optimistic in that sense: it relies on constraint checking by the database, with the assumption that most inserts are not duplicates, so there's no point in wasting time trying to determine if they are.
Working, tested example (5.7.16, backwards compatible to 5.6; previous versions do not have a built-in
TO_BASE64()
function):Note that I am storing the base64 version of the hash. This is a 4:3 size tradeoff compared to storing it in binary form because it makes the table contents and the error message human readable, and the inefficiency is partially offset by the table compression. The hash column has a unique constraint. The data type is
CHAR
, notVARCHAR
, since this eliminates the byte needed to store the size -- the hash is always a fixed size. The column uses theascii
character set withascii_bin
(case-sensitive) collation, keeping the column and the unique index as small as practical.The url_hash is set by a trigger, below, but the trigger does not check for a collision -- there is no need to check, because of the unique constraint on url_hash. The database will block a duplicate insert.
Note that url_hash should have been declared
NOT NULL
but MySQL incorrectly enforces this before theBEFORE INSERT
trigger fires, instead of after, so we are limited by that. The trigger does prevent it from being null.The url column has a prefix index length of 16, which was chosen arbitrarily. This isn't a unique constraint, just an index for lookups, and it is probably shorter than you might want it to be, but its length has no operational impact on the problem we are solving, here.
Here's the trigger to set the url_hash. We don't need to include this value in an
INSERT
statement when we insert rows.You need a trigger on update also, either to block updates if the table is supposed to be immutable, or to update the hash if the URL changes. We also need this trigger to ensure that the url_hash column can't be inappropriately set to
NULL
since the limitation in MySQL doesn't allow us to actually declare it that way, as we should.Now, to test.
So far, so good. Now, a different URL:
Still works. Now, a duplicate.
Perfect. If you want an even lower risk of hash collisions than MD5 provides, use a SHA variant, increasing the length of
data_hash
toCHAR_LENGTH(TO_BASE64(UNHEX( /* your hash function */ )))
to accommodate the values generated by the hash algorithm in use.