Unnecessary 3am (emergency) call

Help, the database is on fire. Well, it probably is but the solution may also be easy. Here are a few steps for the part-time MySQL DBA/sysadmin/developer. Total time to address this solution was 2 minutes, the inability to not go back to sleep, not priceless.

First, access to the DB server is possible, and also the mysql command line client. If you are getting “Too Many Connections” read Why GRANT ALL is bad .

Given the message “the database is on fire”, the likely cause is rogue SQL.

mysql> show global variables like 'max_connections';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 5000  |
+-----------------+-------+
1 row in set (0.02 sec)

mysql> show global status like 'threads%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Threads_cached    | 1     |
| Threads_connected | 5001  |
| Threads_created   | 5002  |
| Threads_running   | 5001  |
+-------------------+-------+
4 rows in set (0.01 sec)

Yeah, as expected. A SHOW PROCESSLIST in this situation shows all queries stuck in the state of “query end”, which indicates the obvious problem to me (a disk space problem).

For the record I do not recommend setting max_connections to 5,000. MySQL (and Linux) does not function with 5,000 concurrent connections (especially when they all want to create a disk based temporary table, but that’s another story). What is missing is a maximum threads running configuration option, and applicable coding practices and proactive server connection management to prevent such a situation.

Confirming my suspicions.

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvde             7.9G  2.5G  5.0G  34% /
tmpfs                 7.3G     0  7.3G   0% /dev/shm
/dev/xvdf1            197G   42G  146G  23% /mysql
/dev/xvdh1             99G   94G  8.0K 100% /mysql/binlog

The solution here is simple, you need to remove old binary logs. It is NOT the recommendation you delete the files manually but use PURGE MASTER LOGS. However this produced an error.

mysql> purge master logs to 'mysql-bin.000029';
ERROR 3 (HY000): Error writing file '/mysql/binlog/mysql-bin.index_crash_safe' (Errcode: 28 - No space left on device)

That’s a new message, possibly a new MySQL 5.6 error situation. The only choice now is to remove a physical file first. A suggestion to MySQL engineers. Let’s create this file in normal operations with sufficient blank bytes, enabling MySQL to be able to create the file even when the disk is full, and then avoid manual file manipulation.

$ rm mysql-bin.000002

Then:

mysql> purge master logs to 'mysql-bin.000029';
Query OK, 0 rows affected, 1 warning (1.37 sec)
mysql> show warnings;
+---------+------+---------------------------------------------------------------+
| Level   | Code | Message                                                       |
+---------+------+---------------------------------------------------------------+
| Warning | 1612 | Being purged log /mysql/binlog/mysql-bin.000002 was not found |
+---------+------+---------------------------------------------------------------+
1 row in set (0.01 sec)

As always, a recommendation when problems occur is to check the error log (aka Monitoring the MySQL Error Log ). This issue is also immediately apparent.

2013-08-23 06:28:36 30609 [Warning] Disk is full writing '/mysql/binlog/mysql-bin.000097' (Errcode: 28 - No space left on device). Waiting for someone to free space...
2013-08-23 06:28:36 30609 [Warning] Retry in 60 secs. Message reprinted in 600 secs

I would also like to point out that being proactive and having monitoring and instrumentation in ALL startups is critical if you want to be successful. Point 1 in my recommendations of Successful Scalability Principles .

Was it really an emergency?

Tagged with: Databases MySQL

Related Posts

Why Being Proactive Is Always a Winning Approach

Many companies manage production infrastructure using a reactive model rather than a proactive one. Organizations typically react to warnings and alerts, then implement corrective actions in response. While some companies have well-designed architectural patterns—such as feature flags and rate limiting—that can quickly mitigate the impact of issues, these are merely temporary solutions, not resolutions.

Read more

AWS CLI support for Aurora DSQL and S3 Tables

If you were following the AWS Re:invent keynote yesterday there were several data specific announcements including Aurora DSQL and S3 Tables . Wanting to check them out, I downloaded the latest AWS CLI 2.

Read more

Migrating off of WordPress - A Simplified Stack

The ongoing drama between Wordpress v WP Engine continues to cross my reading list, but I have permanently removed WordPress from my website. I have finally transitioned away from the complex Linux/Apache/MySQL/PHP (LAMP) stack required for self-hosting WordPress on my professional website.

Read more