Mysql – Idle connection plus schema-modifying query causing locked database

lockingMySQLpythonsleep

As part of our automated deployment process for a web app running on a LAMP stack, we drop all our triggers and stored procedures and recreate them from source control. It turns out there was a hidden danger to this approach that we hadn't thought about.

A few days ago we managed to end up with the database for (the staging version of) our web app stuck in a horribly hung state after the following sequence of events:

  1. I connect to the remote database from our office (via Python's MySQLdb, as it happens) and run a few SELECT queries on the Foo table.
  2. I leave the connection open, because I'm lazy.
  3. My boss commits some changes on his laptop, pushes to the remote repo on the web server, and goes to lunch without looking at the output
  4. The deployment hook on the web server tries to update the triggers and stored procedures in the database, but isn't able to even DROP the first trigger because the trigger involves the Foo table, which my currently sleeping connection had previously done some SELECTs from.
  5. Now nobody can SELECT from the Foo table at all, because the connection trying to DROP the trigger has already taken out a lock on the Foo table that prevents any other connections from accessing the Foo table in any way – even though it's still waiting for the sleeping connection to be closed before it can actually do anything.
  6. Crucial business processes relying upon the Foo table grind to a halt, alarms sound, and our web app stops serving customers. My boss flies into a rage and declares that heads will roll if the cause of the problem is not found and fixed so that this can never happen again. (Just kidding, it was only our staging server and my boss is very friendly.)

What's interesting is that this scenario wasn't caused by any kind of deadlock; it was caused by a sleeping connection implicitly holding some kind of lock that prevented the DROP TRIGGER statement from executing, just by virtue of having done a SELECT on the same table previously. None of the anti-deadlock features of MySQL could automatically kill a process and save the situation, because ultimately everything could continue as soon as my original process – the idle one that had only ever done SELECTs – was killed. The fact that MySQL locks behave this way by default seems perverse to me, but that's not the point. I'm trying to figure out a way to ensure that the disaster scenario described above can't ever recur (especially on our live server). How would you suggest I do this?

We've talked the problem over in the office, and there are a couple of hypothetical solutions we saw:

  • Change some config setting somewhere so that sleeping processes time out after 10 seconds by default, so that a sleeping process can never sit on locks. Better yet, have them just release all locks after 10 seconds so that I can still go to lunch and leave my MySQL shell open, or my Python window open with a MySQLdb connection active, then come back and use it, without fear of breaking anything.

    • This might be really irritating when trying to run queries manually, especially ones that require grouping into a transaction.
  • Work some magic on the queries that try to replace the triggers and stored procedures so that the acquisition of locks required for the relevant DROPs and CREATEs is made into an atomic operation – something like, if the query can't acquire all the locks it needs immediately in sequence, then it releases them and tries again periodically until it works.

    • This might just make our deployment process never complete, though, if the database is too busy for it to be able to grab all the locks in one go.
  • Drastically reduce the frequency of schema-modifying queries we make (it only seems to be these that can be blocked from starting by a connection that's only done SELECTs), for instance by having our deployment script check whether a stored procedure or trigger in source control has changed from the version in the database before DROPping and reCREATEing the one on the database.

    • This only mitigates the problem, it doesn't actually eliminate it.

We're not sure if either of the first two solutions we considered are even possible in MySQL, though, or if we're missing a better solution (we're developers, not DBAs, and this is outside of our comfort zone). What would you recommend?

Best Answer

This is probably because auto commit is off by default, as specified by PEP 249. This seems to cause any SELECT to lock the metadata table. You can probably turn auto commit on (as long as that's safe based on your application code), which will close the implicit transaction associated with the SELECT immediately. Alternatively, use explicit transactions.