Mysql – How to prevent duplicate VARCHAR without a key limit

MySQLunique-constraintvarchar

I'd like to store URLs in a database column, and enforce a constraint that values must be unique. Unfortunately, MySQL has a limit on the length of index keys which means that only the first X characters of the URL gets checked for uniqueness. Thus, I've run into false positives where two different URLs triggered a constraint integration violation because the first X characters just-so-happened to be identical.

Is there a way to enforce uniqueness on a VARCHAR column without any limit on its length?

Is it possible to, say, create a non-UNIQUE index over the first X characters and then have a trigger block INSERTs if the remaining characters are identical?

Best Answer

We keep giving you answers that do not directly answer the question, because that is how we solve this problem. An index of unlimited length is impractical and inefficient, but a unique hash provides a solution that sufficient to the task because of the astronomically low likelihood of a meaningful collision.

Similar to the other offered solutions, my standard approach does not check for duplicates up front -- it is optimistic in that sense: it relies on constraint checking by the database, with the assumption that most inserts are not duplicates, so there's no point in wasting time trying to determine if they are.

Working, tested example (5.7.16, backwards compatible to 5.6; previous versions do not have a built-in TO_BASE64() function):

CREATE TABLE web_page (
  id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  url LONGTEXT NOT NULL,
  url_hash CHAR(24) COLLATE ascii_bin,
  PRIMARY KEY(id),
  UNIQUE KEY(url_hash),
  KEY(url(16))
)ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED;

Note that I am storing the base64 version of the hash. This is a 4:3 size tradeoff compared to storing it in binary form because it makes the table contents and the error message human readable, and the inefficiency is partially offset by the table compression. The hash column has a unique constraint. The data type is CHAR, not VARCHAR, since this eliminates the byte needed to store the size -- the hash is always a fixed size. The column uses the ascii character set with ascii_bin (case-sensitive) collation, keeping the column and the unique index as small as practical.

The url_hash is set by a trigger, below, but the trigger does not check for a collision -- there is no need to check, because of the unique constraint on url_hash. The database will block a duplicate insert.

Note that url_hash should have been declared NOT NULL but MySQL incorrectly enforces this before the BEFORE INSERT trigger fires, instead of after, so we are limited by that. The trigger does prevent it from being null.

The url column has a prefix index length of 16, which was chosen arbitrarily. This isn't a unique constraint, just an index for lookups, and it is probably shorter than you might want it to be, but its length has no operational impact on the problem we are solving, here.

Here's the trigger to set the url_hash. We don't need to include this value in an INSERT statement when we insert rows.

DELIMITER $$
DROP TRIGGER IF EXISTS web_page_bi $$
CREATE TRIGGER web_page_bi BEFORE INSERT ON web_page FOR EACH ROW
BEGIN
  SET NEW.url_hash = TO_BASE64(UNHEX(MD5(NEW.url)));
END $$
DELIMITER ;

You need a trigger on update also, either to block updates if the table is supposed to be immutable, or to update the hash if the URL changes. We also need this trigger to ensure that the url_hash column can't be inappropriately set to NULL since the limitation in MySQL doesn't allow us to actually declare it that way, as we should.

Now, to test.

mysql> INSERT INTO web_page (url) VALUES ('http://example.com/');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM web_page;
+----+---------------------+--------------------------+
| id | url                 | url_hash                 |
+----+---------------------+--------------------------+
|  1 | http://example.com/ | pr8XV//wV/JmtpffnPF2/Q== |
+----+---------------------+--------------------------+
1 row in set (0.00 sec)

So far, so good. Now, a different URL:

mysql> INSERT INTO web_page (url) VALUES ('http://example.net/');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM web_page;
+----+---------------------+--------------------------+
| id | url                 | url_hash                 |
+----+---------------------+--------------------------+
|  1 | http://example.com/ | pr8XV//wV/JmtpffnPF2/Q== |
|  2 | http://example.net/ | ZVk/eLfvBI6tHN0Luj3NnQ== |
+----+---------------------+--------------------------+
2 rows in set (0.00 sec)

Still works. Now, a duplicate.

mysql> INSERT INTO web_page (url) VALUES ('http://example.com/');
ERROR 1062 (23000): Duplicate entry 'pr8XV//wV/JmtpffnPF2/Q==' for key 'url_hash'

Perfect. If you want an even lower risk of hash collisions than MD5 provides, use a SHA variant, increasing the length of data_hash to CHAR_LENGTH(TO_BASE64(UNHEX( /* your hash function */ ))) to accommodate the values generated by the hash algorithm in use.

Related Solutions

MySQL: Unique constraint on large column

You do not want a gigantic gen_clust_index (Internal Clustered Index). That size is ungodly huge even for a secondary index.

You may have to resort to triggers or stored procedures to check for the key well in advance.

You could also think about performing an SHA1 function call using the VARCHAR(3071) field. SHA1 will return a 40-character field. This hash may be just what you need to index.

Suppose you have this

CREATE TABLE mytable
(
    id int not null auto_increment,
    txt VARCHAR(3071),
    primary key (id)
) ENGINE=InnODB;

and you want to make a UNIQUE index on txt. Try the SHA1 approach

CREATE TABLE mytablenew LIKE mytable;
ALTER TABLE mytable ADD txtsha1 CHAR(40);
ALTER TABLE mytable ADD UNIQUE KEY (txtsha1);
INSERT INTO mytablenew (id,txt,txtsha1)
SELECT id,txt,SHA1(txt) FROM mytable;

Then, count them

SELECT COUNT(1) FROM mytable;
SELECT COUNT(1) FROM mytablenew;

If the Counts are the Same, CONGRATULATIONS !!! Now you have a unique index of length 40. You can finish up with:

ALTER TABLE mytable RENAME mytableold;
ALTER TABLE mytablenew RENAME mytable;
DROP TABLE mytableold;

This could be more atomically as pointed out in the comments below:

RENAME TABLE mytable TO mytableold, mytablenew TO mytable;
DROP TABLE mytableold;

Perform this on whatever table you intend to have this big column. You have to remember to add the SHA1 of the data along with the data upon INSERT.

The odds of duplicate keys is 1 in 2 to the 160th power (that 1.4615016373309029182036848327163e+48. If I get the exact figure, I'll post it someday).

Give it a Try !!!

Mysql – InnoDB, Hash Table, Reference Table and Fragmentation

Several ideas come to mind:

The hash table could be a memory table, once your inserts are done, you could then copy just the domains that don't already exist into the permanent domain_list table. Having your new data in memory might also make the domain name comparison between the hash table and the domain_list tables go faster.

Another option would be to use the innodb_file_per_table option and keep the database (idb) files on a faster disk.

A third option would be to load your entire dataset into a memory table and then convert that table to InnoDB using the ALTER TABLE table_name ENGINE=InnoDB. This will be much faster than inserting into InnoDB directly. This is possible since you control how and when data is inserted.

Best Answer

Related Solutions

MySQL: Unique constraint on large column

Mysql – InnoDB, Hash Table, Reference Table and Fragmentation

Related Question