Mariadb – Does ALTER TABLE – ADD COLUMN lock the table

alter-tableamazon-rdsmariadb

I have a question about the ALTER TABLE ... ADD COLUMN DDL statement.

On a Amazon RDS instance with MariaDB v10.2, I've noticed that INSERT statements complete and the rows are correctly inserted in the table (as verified via SELECT) before an ALTER TABLE ... ADD COLUMN on the table finishes.

Shouldn't any DML statement that performs a write be queued until the ALTER TABLE operation finishes?

I'm posting this question because I've been asked to perform some test to verify whether it is possible to run ALTER TABLE ... ADD COLUMN on a live Production database in business hours, on a heavily-used database, on tables with several million rows — which I find very ill-advised. Even if ALTER TABLE does not place a lock on the table, it will have to wait until any connection is not using the table anymore (due to the connection placing a metadata lock), which may happen much much later.

EDIT: Apparently this evaluation was too pessimistic. I've been doing several tests with mysqlslap performing heavy operations on the table (INSERT, UPDATE, DELETE statements, and SELECT statements with LIKE to avoid using indexes) on 150 simulated concurrent connections while ALTER TABLE ... ADD COLUMN runs; profiling shows metadata locks but with short waiting times (1 sec each), and table alteration completes in around 30 minutes, compared to 10 minutes with no SQL statements running. While this is satisfying, on the other hand I'd like to know whether it is safe to assume that DDL statements are non-blocking.

(It is probably worth of note that there is an Instant ADD COLUMN feature on InnoDB, which allows instant addition of a column to the table (under specific constraints), but it is not available before v10.3.2.)

Best Answer

Yes, it locks the table. From the docs on MySQL 8,

The exception referred to earlier is that ALTER TABLE blocks reads (not just writes) at the point where it is ready to clear outdated table structures from the table and table definition caches. At this point, it must acquire an exclusive lock. To do so, it waits for current readers to finish, and blocks new reads and writes.

And from the docs you linked, it's pretty explicit

With instant ADD COLUMN, you can enjoy all the benefits of structured storage without the drawback of having to rebuild the table.

As you've stated, you're on 10.2. So it looks like adding a column will require rebuilding the whole table.

As to what happens when you can't receive a lock,

I've noticed that INSERT statements complete and the rows are correctly inserted in the table (as verified via SELECT) before an ALTER TABLE ... ADD COLUMN on the table finishes.

Yes, that's generally what happens during a lock, but be aware that's not always what happens. Sometimes statements and transactions give up waiting. Sometimes backends and pools get reaped when they're stuck waiting. It's always safer to do this during downtime, to have timeouts, and to catch errors from libraries when the timeouts expire. So long as you're using transactions, things rollback if something is triggered and can't get it's lock before timeout -- all will be kosher.

Related Solutions

Postgresql – Adding nullable column to table costs more than 10 minutes

There are a couple of misunderstandings here:

The null bitmap is not part of the heap tuple header. Per documentation:

There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap ...

Your 32 nullable columns are unsuspicious for two reasons:

The null bitmap is added per row, and only if there is at least one actual NULL value in the row. Nullable columns have no direct impact, only actual NULL values do. If the null bitmap is allocated, it's always allocated completely (all or nothing). The actual size of the null bitmap is 1 bit per column, rounded up to the next byte. Per current souce code:
```
#define BITMAPLEN(NATTS) (((int)(NATTS) + 7) / 8)
```
The null bitmap is allocated after the heap tuple header and followed by an optional OID and then row data. The start of an OID or row data is indicated by t_hoff in the header. Per comment source code:

Note that t_hoff must be a multiple of MAXALIGN.
There is one free byte after the heap tuple header, which occupies 23 bytes. So the null bitmap for rows up to 8 columns effectively comes at no additional cost. With the 9th column in the table, t_hoff is advanced another MAXALIGN (typically 8) bytes to provide for another 64 columns. So the next border would be at 72 columns.

To display control information of a PostgreSQL database cluster (incl. MAXALIGN), example for a typical installation of Postgres 9.3 on a Debian machine:

    sudo /usr/lib/postgresql/9.3/bin/pg_controldata /var/lib/postgresql/9.3/main

I updated instructions in the related answer you quoted.

All that aside, even if your ALTER TABLE statement triggers a whole table rewrite (which it probably does, changing a data type), 250K are really not that much and would be a matter of seconds on any halfway decent machine (unless rows are unusually big). 10 minutes or more indicate a completely different problem. Your statement is waiting to get a lock on the table, most likely.

The growing number of entries in pg_stat_activity means more open transactions - indicates concurrent access on the table (most likely) that has to wait for the operation to finish.

A few shots in the dark

Check for possible table bloat, try a gentle VACUUM mytable or a more aggressive VACUUM FULL mytable - which might encounter the same concurrency issues, since this form also acquires an exclusive lock. You could try pg_repack instead ...

I would start by inspecting possible issues with indexes, triggers, foreign key or other constraints, especially those involving the column. Especially a corrupted index might be involved? Try REINDEX TABLE mytable; or DROP all of them and re-add them after ALTER TABLE in the same transaction.

Try running the command in the night or whenever there is not much load.

A brute-force method would be to stop access to the server, then try again:

Force drop db while others may be connected in postgresql

Without being able to pin it down, upgrading to the current version or the upcoming 9.4 in particular might help. There have been several improvements for big tables and for locking details. But if there is something broken in your DB, you should probably figure that out first.

Mysql – ALTER TABLE … ROW_FORMAT=Compressed times out

Please note that I am not a MySQL developer, I use MS SQL Server. But the behavior in your post suggests the following:

It does not look like a deadlock to me, especially with the error messages:

InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.

A deadlock normally would quickly determine which connection to roll back instead of processing up to 600 seconds before crashing the server.

Likely what is happening is that the table metadata change (the ALTER TABLE) cannot happen while there are other transactions using the table.

http://dev.mysql.com/doc/refman/5.6/en/metadata-locking.html

The link says, in part: "To ensure transaction serializability, the server must not permit one session to perform a data definition language (DDL) statement on a table that is used in an uncompleted explicitly or implicitly started transaction in another session."

EDIT: Sorry for the misunderstanding regarding deadlocks.

Per this post, https://stackoverflow.com/questions/24860111/warning-a-long-semaphore-wait

Ensure that: Innodb_adaptive_hash_index=0

Best Answer

Related Solutions

Postgresql – Adding nullable column to table costs more than 10 minutes

A few shots in the dark

Mysql – ALTER TABLE … ROW_FORMAT=Compressed times out

Related Question