I have a table of unique values (domains_unique
), with column domain
varchar(255), with more than 20 mil records.
What's the fastest way to insert into the table, by keeping the domain
unique constraint?
I've decided that the query should be :
INSERT IGNORE INTO domains_table (domain) VALUE ('domain.com')
Should I make domain
the primary key or should I make it a unique index?
Primary key method:
CREATE TABLE `domains_unique`
(
`domain` varchar(255) NOT NULL
PRIMARY KEY (`domain`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE = utf8_general_ci;
Unique index method:
CREATE TABLE `domains_unique`
(
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`domain` varchar(255) NOT NULL
PRIMARY KEY (`id`),
UNIQUE KEY unique_index (`domain`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE = utf8_general_ci;
Also how would much would changing the CHARSET
and COLLATION
to ascii_bin
affect performance?
obs:
I would use INSERT DELAYED IGNORE INTO
, but I need to know if the specific row (domain) was inserted, that also rules out the bulk insert.
UPDATE:
As promised the benchmarks results:
INSERT 5k unique rows with 4.5k new rows into a 1 mil row table, one row at a time:
- Primary Key method: 5.7 seconds
- Unique Index method: 6.3 seconds
To test scaling I've also tested the Unique Index method on a 40mil rows table and it took 45.8 s
Best Answer
From your question, I'm not completely sure if you're inserting multiple values per query, but you definitely should be. With MyISAM as the engine, there should not be a performance difference between the index being
UNIQUE
orPRIMARY
; MyISAM doesn't treat them differently in this case. If you were using InnoDB, however, there would be a difference since it stores the data in primary key order. If you don't need theid
column, removing it and makingdomain
the primary key would help performance.Changing the collation should help since
ascii
is much simpler thanutf8
, but you might want to useascii_general_ci
instead ofascii_bin
since domain names are case-insensitive.One other way to do the queries would be to get the number of rows,
INSERT DELAYED
, flush the delayed writes, and then get the new row count. The difference in the counts would be the same as the affected rows. However, I don't think this would be significantly faster, but it would make the process more complex.