Mysql – How to prevent duplicate VARCHAR without a key limit

MySQLunique-constraintvarchar

I'd like to store URLs in a database column, and enforce a constraint that values must be unique. Unfortunately, MySQL has a limit on the length of index keys which means that only the first X characters of the URL gets checked for uniqueness. Thus, I've run into false positives where two different URLs triggered a constraint integration violation because the first X characters just-so-happened to be identical.

Is there a way to enforce uniqueness on a VARCHAR column without any limit on its length?

Is it possible to, say, create a non-UNIQUE index over the first X characters and then have a trigger block INSERTs if the remaining characters are identical?

Best Answer

We keep giving you answers that do not directly answer the question, because that is how we solve this problem. An index of unlimited length is impractical and inefficient, but a unique hash provides a solution that sufficient to the task because of the astronomically low likelihood of a meaningful collision.

Similar to the other offered solutions, my standard approach does not check for duplicates up front -- it is optimistic in that sense: it relies on constraint checking by the database, with the assumption that most inserts are not duplicates, so there's no point in wasting time trying to determine if they are.

Working, tested example (5.7.16, backwards compatible to 5.6; previous versions do not have a built-in TO_BASE64() function):

CREATE TABLE web_page (
  id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  url LONGTEXT NOT NULL,
  url_hash CHAR(24) COLLATE ascii_bin,
  PRIMARY KEY(id),
  UNIQUE KEY(url_hash),
  KEY(url(16))
)ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED;

Note that I am storing the base64 version of the hash. This is a 4:3 size tradeoff compared to storing it in binary form because it makes the table contents and the error message human readable, and the inefficiency is partially offset by the table compression. The hash column has a unique constraint. The data type is CHAR, not VARCHAR, since this eliminates the byte needed to store the size -- the hash is always a fixed size. The column uses the ascii character set with ascii_bin (case-sensitive) collation, keeping the column and the unique index as small as practical.

The url_hash is set by a trigger, below, but the trigger does not check for a collision -- there is no need to check, because of the unique constraint on url_hash. The database will block a duplicate insert.

Note that url_hash should have been declared NOT NULL but MySQL incorrectly enforces this before the BEFORE INSERT trigger fires, instead of after, so we are limited by that. The trigger does prevent it from being null.

The url column has a prefix index length of 16, which was chosen arbitrarily. This isn't a unique constraint, just an index for lookups, and it is probably shorter than you might want it to be, but its length has no operational impact on the problem we are solving, here.

Here's the trigger to set the url_hash. We don't need to include this value in an INSERT statement when we insert rows.

DELIMITER $$
DROP TRIGGER IF EXISTS web_page_bi $$
CREATE TRIGGER web_page_bi BEFORE INSERT ON web_page FOR EACH ROW
BEGIN
  SET NEW.url_hash = TO_BASE64(UNHEX(MD5(NEW.url)));
END $$
DELIMITER ;

You need a trigger on update also, either to block updates if the table is supposed to be immutable, or to update the hash if the URL changes. We also need this trigger to ensure that the url_hash column can't be inappropriately set to NULL since the limitation in MySQL doesn't allow us to actually declare it that way, as we should.

Now, to test.

mysql> INSERT INTO web_page (url) VALUES ('http://example.com/');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM web_page;
+----+---------------------+--------------------------+
| id | url                 | url_hash                 |
+----+---------------------+--------------------------+
|  1 | http://example.com/ | pr8XV//wV/JmtpffnPF2/Q== |
+----+---------------------+--------------------------+
1 row in set (0.00 sec)

So far, so good. Now, a different URL:

mysql> INSERT INTO web_page (url) VALUES ('http://example.net/');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM web_page;
+----+---------------------+--------------------------+
| id | url                 | url_hash                 |
+----+---------------------+--------------------------+
|  1 | http://example.com/ | pr8XV//wV/JmtpffnPF2/Q== |
|  2 | http://example.net/ | ZVk/eLfvBI6tHN0Luj3NnQ== |
+----+---------------------+--------------------------+
2 rows in set (0.00 sec)

Still works. Now, a duplicate.

mysql> INSERT INTO web_page (url) VALUES ('http://example.com/');
ERROR 1062 (23000): Duplicate entry 'pr8XV//wV/JmtpffnPF2/Q==' for key 'url_hash'

Perfect. If you want an even lower risk of hash collisions than MD5 provides, use a SHA variant, increasing the length of data_hash to CHAR_LENGTH(TO_BASE64(UNHEX( /* your hash function */ ))) to accommodate the values generated by the hash algorithm in use.