MySQL Monitoring – What's really needed

The implementation of MySQL Monitoring is critical for any organization that uses a database and wants to avoid the inevitable disaster. There are 3 important components that all serve a key purpose to “MySQL Monitoring” in general:

Monitoring – Historical and graphical information
Alerting – Tell me when something is wrong
Dashboard – The State of NOW

Monitoring

There is no one option for Monitoring that is significantly better then another. A short list of what’s on offer can be found at http://monitoring-mysql.com/monitoring-products. What’s important is you have monitoring in place so historically you can review situations and compare across your servers and enabling the better identification of physical or database bottlenecks. My recommendations for products are Cacti which is packaged with most popular Linux distrubtions, so can be installed via a single apt-get/yum/rpm command, and the MySQL Cacti Templates. This is not the best product solution or combination, it’s in my opinion the most common and covers all the essential bases.

It’s best to define a different web server (i.e. publically accessible) to be the monitoring server rather then installing the web interface on any single DB server.

Alerting

Alerting is key to notify you of a problem without the need for somebody to be viewing a screen and see it happening via your monitoring. The identification of high CPU load, a disk nearing capacity, database locking etc often helps avoid a current problem before it becomes some level of disaster. Almost all companies use Nagios or a derivative such as the main fork Icigna or products that include Nagios like Opsview or Groundworks.

Dashboard

While both Monitoring and Alerting are necessary, they both however lack a key component necessary for successful administration. That is timing. Both of these earlier options sample, e.g. 1 minute or generally 5 minutes (by default), but problems can happen quickly. This is why each organization needs to have a Dashboard. I don’t know of any products here unless you try and adjust a monitoring product, but want’s needed is a very lightweight and very business centric single page of Green/Yellow/Red status’s of your environment including databases, webservers, response time and traffic etc. This is for the state of NOW. A Dashboard should sample every 5 to 10 seconds. I have seen larger and more successful companies have various home grown implementations. I developed a product for one company and it included the following on a single page. You can see the screen output in a presentation at http://www.slideshare.net/ronaldbradford/10x-performance-improvements. This included

5 DB servers monitoring load average, ping time, database connections, active,free,locked, replication availability and lag
5 web servers monitoring load average, ping time, apache connections
Application metrics monitoring 3 different page load times, and page size

It is often important to be able to identify a key problem and then drill down to this more quickly rather then the usual “the website is slow” question and having to investigate the same repetitive tasks. You need to automate, and be more pro-active in response especially to load and locking issues.

Advanced Monitoring

Information is only have the requirement, it is what you do with this information that determines how to be proactive rather then reactive. If I was the DBA of a company I’d do even more then these initial 3 steps which are a necessary base. I would also monitor for example:

Database size and growth. This is important to be preemptive about your capacity. Example SQL at http://ronaldbradford.com/mysql-dba/#allschemas
Error Log changes
Backup timing. This is important as your DB grows as it affects recovery.
Recovery timing
Gather raw MySQL status information because monitoring tools only capture what you ask it to do, not everything. While you may not analyze all now, you may want to in the future look back in time. Example scripts at http://ronaldbradford.com/mysql-dba/#log-stats
Hourly/Daily text reports. Producing a easy readable SHOW GLOBAL status report such as with statpack will for example enable me to know network throughtput in the DB, transaction throughput and key indicators of locking, disk access etc. While you may have a graphic interface, it’s a lot easier to automate and grep text reports.
Proactive restrictions. The Twitter failed whale is a great example of when the system moves closer to known limits, but before those limits they start limiting load. This includes for example to disable less critical but resource intensive functionality, e.g. people search. The also start rejecting connections so they do not reach a crash state. This could be proactively changing timeout values so the DB fails queries, and the webservers respond accordingly with a try again approach.

While these are important if you have only limited resources, too much information can be just as much of a burden then people just start ignoring the information and miss what’s important.