MySQL – Implementing a Unique Key for a Long VARCHAR Column

MySQLvarchar

I need to let MySQL to check the uniqueness of a VARCHAR column which is 5000 max length and is not the primary key. I can use an index of length 500 on the column to speed up query, but how to create an unique key that works?

An unique key of length 500 fails when trying to insert two records whose that column is the same for the first 500 characters and differs on somewhere after that.

Best Answer

Checking directly for uniqueness of a 5000-character string is going to be pretty expensive, no matter how you implement it.

I'd suggest computing a hash, storing that hash, and computing uniqueness on that hash. This is almost the same thing, but does introduce the opportunity for a hash collision, where two different values result in the same hash.

The likelihood of a collision will vary with the hashing algorithm: MD5 has a 1 in 2¹²⁸ chance in a collision. SHA1, and SHA256 each provide better protection against collision, but are slower/more expensive.

For example: You could create a computed/virtual column with the MD5() function, then create a unique index on that column. This won't guarantee perfect uniqueness, but is likely unique enough for most purposes where you have a 5k chunk of text that you want to keep unique.

With hashing, if two different, unique strings have a hash collision, this would prevent that second unique value from being inserted. However, two identical strings will produce the same hash. So the problem with a weak hashing algorithm is that it would be too aggressive in preventing dupes when it shouldn't.

Related Solutions

MySQL Unique Constraint – Handling Large Columns

You do not want a gigantic gen_clust_index (Internal Clustered Index). That size is ungodly huge even for a secondary index.

You may have to resort to triggers or stored procedures to check for the key well in advance.

You could also think about performing an SHA1 function call using the VARCHAR(3071) field. SHA1 will return a 40-character field. This hash may be just what you need to index.

Suppose you have this

CREATE TABLE mytable
(
    id int not null auto_increment,
    txt VARCHAR(3071),
    primary key (id)
) ENGINE=InnODB;

and you want to make a UNIQUE index on txt. Try the SHA1 approach

CREATE TABLE mytablenew LIKE mytable;
ALTER TABLE mytable ADD txtsha1 CHAR(40);
ALTER TABLE mytable ADD UNIQUE KEY (txtsha1);
INSERT INTO mytablenew (id,txt,txtsha1)
SELECT id,txt,SHA1(txt) FROM mytable;

Then, count them

SELECT COUNT(1) FROM mytable;
SELECT COUNT(1) FROM mytablenew;

If the Counts are the Same, CONGRATULATIONS !!! Now you have a unique index of length 40. You can finish up with:

ALTER TABLE mytable RENAME mytableold;
ALTER TABLE mytablenew RENAME mytable;
DROP TABLE mytableold;

This could be more atomically as pointed out in the comments below:

RENAME TABLE mytable TO mytableold, mytablenew TO mytable;
DROP TABLE mytableold;

Perform this on whatever table you intend to have this big column. You have to remember to add the SHA1 of the data along with the data upon INSERT.

The odds of duplicate keys is 1 in 2 to the 160th power (that 1.4615016373309029182036848327163e+48. If I get the exact figure, I'll post it someday).

Give it a Try !!!

Mysql – Unique insert performance: Primary Key vs Unique Index

From your question, I'm not completely sure if you're inserting multiple values per query, but you definitely should be. With MyISAM as the engine, there should not be a performance difference between the index being UNIQUE or PRIMARY; MyISAM doesn't treat them differently in this case. If you were using InnoDB, however, there would be a difference since it stores the data in primary key order. If you don't need the id column, removing it and making domain the primary key would help performance.

Changing the collation should help since ascii is much simpler than utf8, but you might want to use ascii_general_ci instead of ascii_bin since domain names are case-insensitive.

One other way to do the queries would be to get the number of rows, INSERT DELAYED, flush the delayed writes, and then get the new row count. The difference in the counts would be the same as the affected rows. However, I don't think this would be significantly faster, but it would make the process more complex.

Best Answer

Related Solutions

MySQL Unique Constraint – Handling Large Columns

Mysql – Unique insert performance: Primary Key vs Unique Index

Related Question