Do you control your database outages?

Working with a client last week I noted in my analysis, “The mysql server was restarted on Thursday and so the [updated] my.cnf settings seems current”. This occurred between starting my analysis on Wednesday and delivering my findings on Friday.

# more /var/lib/mysql/ip-104-238-102-213.secureserver.net.err
160609 17:04:43 [Note] /usr/sbin/mysqld: Normal shutdown

The client however stated they did not restart MySQL and would not do that at 5pm which is still a high usage time of the production system. This is unfortunately not an uncommon finding, that a production system had an outage and that the client did not know about it and did not instigate this.

There are several common causes and the “DevOps” mindset for current production systems has made this worse.

  • You have managed hosting and they perform software updates with/without your knowledge. I have for example worked with several Rackspace customers and there would be an outage because Rackspace engineers decided to apply an upgrade at a time that suited them, not their customers.
  • You have chosen automate updates for your Operating System.
  • Your developers update the software when they like.
  • You are using a 3rd party product that is making an arbitrary decision.

In this case the breadcrumbs lead to the last option, that cPanel is performing this operation as hinted by the cPanel specific installed MySQL binaries.

$ rpm -qa | grep -i mysql
cpanel-perl-522-MySQL-Diff-0.43-1.cp1156.x86_64
MySQL55-devel-5.5.50-1.cp1156.x86_64
MySQL55-client-5.5.50-1.cp1156.x86_64
cpanel-perl-522-DBD-mysql-4.033-1.cp1156.x86_64
compat-MySQL50-shared-5.0.96-4.cp1136.x86_64
MySQL55-server-5.5.50-1.cp1156.x86_64
cpanel-mysql-libs-5.1.73-1.cp1156.x86_64
MySQL55-shared-5.5.50-1.cp1156.x86_64
MySQL55-test-5.5.50-1.cp1156.x86_64
compat-MySQL51-shared-5.1.73-1.cp1150.x86_64
cpanel-mysql-5.1.73-1.cp1156.x86_64

Also note that cPanel still uses MySQL 5.1 shared libraries.

So why did cPanel perform not one shutdown, but two in immediate succession? The first was 17 seconds, the second was 2 seconds. Not being experienced with cPanel I cannot offer an answer for this shutdown occurance. I can for others which I will detail later.

160609 17:04:24 [Note] /usr/sbin/mysqld: Normal shutdown
...
160609 17:04:28 [Note] /usr/sbin/mysqld: Shutdown complete
...
160609 17:04:41 [Note] /usr/sbin/mysqld: ready for connections.
...
160609 17:04:43 [Note] /usr/sbin/mysqld: Normal shutdown
...
160609 17:04:45 [Note] /usr/sbin/mysqld: ready for connections.
...

And why did the customer not know about the outage? If you use popular SaaS monitoring solutions such as New Relic and Pingdom you would not have been informed because these products have a sampling time of 60 seconds. I use these products along with Nagios on my personal blog site as they provide adequate instrumentation based on the frequency of usage. I would not recommend these tools as the only tools used in a high volume production system simply because of this one reason. In high volume system you need sampling are much finer granularity.

So just when you were going to justify that 17 seconds while unexpected is tolerable, I want to point out that this subsequently occurred and the outage was over 4 minutes.

160619 11:58:07 [Note] /usr/sbin/mysqld: Normal shutdown
...
160619 12:02:26 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
...

An analysis of the MySQL error log which is correctly not rolled as I always recommend showed a pattern of regular MySQL updates, from 5.5.37 thru 5.5.50. This is the most likely reason a 3rd party product has performed a database outage, to perform a software update at their choosing, not yours.

150316  3:54:11 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.37-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

150316  3:54:22 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.37-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

150316 19:07:31 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.37-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

150317  2:05:45 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.40-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

150317  2:05:54 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.40-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

150319  1:17:26 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.42-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

150319  1:17:34 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.42-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

150616  1:39:44 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.42-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

150616  1:39:52 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.42-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

151006  1:01:45 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.45-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

151006  1:01:54 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.45-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

151027  1:21:12 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.46-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160105  1:31:35 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.47-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160211  1:52:47 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.48-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160211  1:52:55 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.48-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160503  1:14:59 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.49-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160503  1:15:03 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.49-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160521 18:46:24 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.49-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160522 11:51:45 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.49-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160529 15:26:41 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.49-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160529 15:30:12 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.49-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160604 23:29:15 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.49-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160609 17:04:41 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.50-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160609 17:04:45 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.50-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)

160619 12:21:58 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.50-cll'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MySQL Community Server (GPL)

What is intriguing from this analysis is that several versions were skipped including .38, .39, .41, .43, .44. One may ask the question why?

For Clients

This leads to several questions of the strategy used in your organization for controlling outages of your MySQL infrastructure for upgrades or for other reasons.

  • What is an acceptable outage time?
  • What is the acceptable maintenance window to perform outages?
  • What is your release cadence for MySQL upgrades?
  • Who or what performs updates?
  • Can your monitoring detect small outages?

You should also consider in your business strategy having a highly available (HA) MySQL infrastructure to avoid any outage, or application intelligence to support varying levels of data access as I describe in Successful MySQL Scalability Principles.