You do not want a gigantic gen_clust_index (Internal Clustered Index). That size is ungodly huge even for a secondary index.
You may have to resort to triggers or stored procedures to check for the key well in advance.
You could also think about performing an SHA1 function call using the VARCHAR(3071)
field. SHA1 will return a 40-character field. This hash may be just what you need to index.
Suppose you have this
CREATE TABLE mytable
(
id int not null auto_increment,
txt VARCHAR(3071),
primary key (id)
) ENGINE=InnODB;
and you want to make a UNIQUE
index on txt. Try the SHA1 approach
CREATE TABLE mytablenew LIKE mytable;
ALTER TABLE mytable ADD txtsha1 CHAR(40);
ALTER TABLE mytable ADD UNIQUE KEY (txtsha1);
INSERT INTO mytablenew (id,txt,txtsha1)
SELECT id,txt,SHA1(txt) FROM mytable;
Then, count them
SELECT COUNT(1) FROM mytable;
SELECT COUNT(1) FROM mytablenew;
If the Counts are the Same, CONGRATULATIONS !!! Now you have a unique index of length 40. You can finish up with:
ALTER TABLE mytable RENAME mytableold;
ALTER TABLE mytablenew RENAME mytable;
DROP TABLE mytableold;
This could be more atomically as pointed out in the comments below:
RENAME TABLE mytable TO mytableold, mytablenew TO mytable;
DROP TABLE mytableold;
Perform this on whatever table you intend to have this big column. You have to remember to add the SHA1 of the data along with the data upon INSERT
.
The odds of duplicate keys is 1 in 2 to the 160th power (that 1.4615016373309029182036848327163e+48. If I get the exact figure, I'll post it someday).
Give it a Try !!!
From your question, I'm not completely sure if you're inserting multiple values per query, but you definitely should be. With MyISAM as the engine, there should not be a performance difference between the index being UNIQUE
or PRIMARY
; MyISAM doesn't treat them differently in this case. If you were using InnoDB, however, there would be a difference since it stores the data in primary key order. If you don't need the id
column, removing it and making domain
the primary key would help performance.
Changing the collation should help since ascii
is much simpler than utf8
, but you might want to use ascii_general_ci
instead of ascii_bin
since domain names are case-insensitive.
One other way to do the queries would be to get the number of rows, INSERT DELAYED
, flush the delayed writes, and then get the new row count. The difference in the counts would be the same as the affected rows. However, I don't think this would be significantly faster, but it would make the process more complex.
Best Answer
Checking directly for uniqueness of a 5000-character string is going to be pretty expensive, no matter how you implement it.
I'd suggest computing a hash, storing that hash, and computing uniqueness on that hash. This is almost the same thing, but does introduce the opportunity for a hash collision, where two different values result in the same hash.
The likelihood of a collision will vary with the hashing algorithm: MD5 has a 1 in 2128 chance in a collision. SHA1, and SHA256 each provide better protection against collision, but are slower/more expensive.
For example: You could create a computed/virtual column with the
MD5()
function, then create a unique index on that column. This won't guarantee perfect uniqueness, but is likely unique enough for most purposes where you have a 5k chunk of text that you want to keep unique.With hashing, if two different, unique strings have a hash collision, this would prevent that second unique value from being inserted. However, two identical strings will produce the same hash. So the problem with a weak hashing algorithm is that it would be too aggressive in preventing dupes when it shouldn't.