MySQL Performance – ON DUPLICATE KEY UPDATE vs. UPDATE

MySQLupdate

I have a table with about 17 million rows:

mysql> describe humans_we_respect;
+---------------------+-------------------------------------------------------------------------+------+-----+---------+-------+
| Field               | Type                                                                    | Null | Key | Default | Extra |
+---------------------+-------------------------------------------------------------------------+------+-----+---------+-------+
| id                  | bigint(20)                                                              | NO   | PRI | NULL    |       |
| name                | varchar(63)                                                             | YES  |     | NULL    |       |
| address             | varchar(127)                                                            | YES  |     | NULL    |       |
| city                | varchar(63)                                                             | YES  |     | NULL    |       |
| state               | varchar(3)                                                              | YES  | MUL | NULL    |       |
| zip                 | varchar(15)                                                             | YES  |     | NULL    |       |
| country             | varchar(15)                                                             | YES  |     | NULL    |       |
| email               | varchar(127)                                                            | YES  |     | NULL    |       |
| website             | varchar(127)                                                            | YES  |     | NULL    |       |
| area_code_state     | varchar(3)                                                              | YES  | MUL | NULL    |       |
| timezone            | set('other','pacific','mountain','central','eastern','alaska','hawaii') | YES  |     | other   |       |
+---------------------+-------------------------------------------------------------------------+------+-----+---------+-------+
12 rows in set (0.01 sec)

Due to the strict nature of only contacting those who expressed interest in a newsletter, and the strict nature of never contacting someone who asked not to be contacted, before a mailing I added a field expressed_interest (tinyint) deafult null which I switch to 1 for those who expressed interest, and then switch to null for those who asked not to be contacted.

The following query, in which 10000 rows are updated per query, takes a very long time to run (killed after half an hour):

UPDATE humans_we_respect SET expressed_interest=1 WHERE id IN (1,...,10000);

However the following query completes in seconds:

INSERT INTO humans_we_respect (id) VALUES (1),...,(10000) ON DUPLICATE KEY UPDATE expressed_interest=1;

Under what conditions will ON DUPLICATE KEY UPDATE be faster than UPDATE? I would like to know this for future use with large tables such as this.

This is on MySQL 5.5.33 running in Amazon RDS.

Best Answer

I know it is not easy getting an execution plan for an update from MySQL since it only provides those on SELECT statements. But the clue might be in the order in which the records are updated, the evaluation of a WHERE which contains a IN with a large amount of static data, as well as the amount of connected reads and writes, intermediate caching, associated with that.

The statement

UPDATE humans_we_respect SET expressed_interest=1 WHERE id IN (1,...,10000);

is a type of statement we try to avoid when updating larger databases since the parser seems to go wild on them from time to time. IN ( a,b,c,...,ZZZZ ) for me has become a coding style only suitable for very small item numbers in the IN data. I am working on a open source project where I frequently run into what I call "remote minded join", the second half usually looks exactly like your problem.

SELECT id FROM all_our_customers WHERE happytospam=1 AND LENGTH(email) > 6;
...
Storing result on client side as string like 
LOOP over results
$all_ids += ",$next_result";
END_LOOP
$all_ids = SUBSTRING($all_ids,1); 
ending up with a string like 
"1,2,3,4,5,8,10,100,1000,...,100000" in $all_ids
...
UPDATE humans_we_respect SET expressed_interest=1 WHERE id IN ( $all_ids )

While the first part usually executes lightning fast, the second part takes forever, which is what you describe as well. These queries can usually be sped up EXTREMELY by rewriting them as:

UPDATE humans_we_respect,all_our_customers 
SET humans_we_respect.expressed_interest=1 
WHERE all_our_customers.id = humans_we_respect.id 
AND all_our_customers.happytospam=1 
AND LENGTH(all_our_customers.email) > 6

We also used

UPDATE humans_we_respect 
SET expressed_interest=1 
WHERE id IN ( 
SELECT id 
FROM all_our_customers 
WHERE happytospam=1 
AND LENGTH(email) > 6 
)

which performed better than the original but not as good as my suggested version.

This all is assuming that you use proper indexes with primary on ID and combined multi column indexes where multiple columns are frequently used together or have a good significance together and are usually present in your queries.

The Clue is that large amounts of static values in IN clauses increase the execution time nearly exponentially on queries with many matching records, since they basically do NOT USE ANY INDEXES OR OPTIMIZATIONS and usually end up in full table scans, in which IMHO the execution will examine every record/line in your table by comparing it with every item in your IN() list ONE by ONE.

The statement like

INSERT INTO humans_we_respect (id) VALUES (1),...,(10000) ON DUPLICATE KEY UPDATE expressed_interest=1;

however is using the index to locate the record and then updates it, even so not intended for this use, it will run much better due to the use of a index on ID if there is one and will only do a index-lookup for the record rather than thousands of comparisons! Working directly with two tables might however be faster if the list of IDs is derived from another table on the same server, there are more optimizations that can be used plus you do not have to transfer data out-of and into the mysql server process.

Just as some extra info:

UPDATE humans_we_respect SET expressed_interest=1 WHERE id='1' OR id='2' OR ...

is a good technique to optimize IN() queries with very low counts of elements since it will create a parallel index query for every element, which is great for the first elements but will decrease in performance with more elements a lot, and at some point hit the limit of the parser for optimization (IMHO it might be 255 elements in one query) at which point it will grind to a snails pace again...

Related Solutions

MySQL ON DUPLICATE KEY UPDATE performance issue

Try to start a transaction or set autocommit=false prior the updates.

start transaction;
.... a lot of updates here ...
commit;

Also, the swapping may yield to the lack of physical RAM which occurs in some situations (big resources created by MySQL).

You may also try to
- increase the "key_buffer_size" to the maximum permitted by your SO and MySQL version (check manual). This will improve index updates.
- reduce "max_heap_table_size" (128MB or less) and increase "tmp_table_size" (the bigger, the better).
The idea is to avoid swappiness when some resources needs too much RAM, by sending the work to on-disk temporary tables by default. Some operations will suffer on speed (the bigger ones), but in whole the server will run smoother.

Of course, you may have a lot of memory so a perfect answer cannot be supplied without all informations, at least RAM quantity and server variables, because your problems may start there.

Mysql – on duplicate key update

^{Community Wiki answer originally based on notes by ypercubeᵀᴹ and Rick James suggested in the comments on the question.}

The immediate issue is the parentheses around the UPDATE set list: they should not be there according to the syntax. That is what the error message is about.

Once you have fixed that one, there are other issues with your INSERT statement.

The IGNORE keyword just does not make sense when used together with ON DUPLICATE KEY. You either perform an update when a duplicate key is encountered, or you ignore the duplicate key entry completely, and it looks as though you want the former – therefore, lose the IGNORE keyword.
Setting product_id to '' in the ON DUPLICATE KEY clause is not consistent with setting it to NULL in the VALUES clause: an empty string ('') is not the same as a null. Judging from the column name, you probably want to have NULL specified in both places.
Similarly, seeting id to '1' in the ON DUPLICATE KEY clause is not consistent with setting it to 1 in the VALUES clause: the string ('1') is not the same as the number (1). This will probably not be an issue as the string will be converted to a number but it doesn't hurt to be explicit and very careful with the types of your column and values you pass to them.
While this is not exactly a problem, you don't have to specify the same values twice (in VALUES and in ON DUPLICATE KEY UPDATE clauses). You can specify them once and use them in two places, using the VALUES() function, like this:
```
INSERT INTO shop_item_details 
  (shop_id, product_id, product_name, Product_model, 
   Product_category, Product_details, price) 
VALUES 
  (1, NULL, 'camera', 'sony', '', 'hd', '5000') 
ON DUPLICATE KEY UPDATE
  shop_id          = VALUES(shop_id), 
  product_id       = VALUES(product_id), 
  product_name     = VALUES(product_name), 
  Product_model    = VALUES(Product_model), 
  Product_category = VALUES(Product_category), 
  Product_details  = VALUES(Product_details), 
  price            = VALUES(price) ;
```
Please note that the above is just an example. Since you are updating the same set of columns as those you are inserting into, setting each of them using the column = VALUES(column) pattern is slightly redundant. The columns that define the unique (or primary) key being violated (you have not specified which columns they are in your case - perhaps shop_id and product_id? Or shop_id and product_name and product_model?) can be safely removed from the ON DUPLICATE KEY UPDATE clause.

Best Answer

Related Solutions

MySQL ON DUPLICATE KEY UPDATE performance issue

Mysql – on duplicate key update

Related Question