MySQL Monitoring 101 – Graph your results

The problem

Hands up those that don’t monitor their production MySQL web server. I’m a little surprised by this, but I’ve visited several clients that have absolutely no monitoring other then “the customers will tell us when something is wrong”. The lack of system monitoring is one of the topics in my book “101 ways to screw up a successful startup”.

Why is monitoring important? First it can tell you when something is wrong, most monitoring systems introduce some level of constraints that trigger notifications via email, SMS or red flashing screens. Second, and I consider more important, is it allows you to analyze change and compare results over time. Let’s say you added more memory to your server, and then remembered to also increase the MySQL buffers appropriately. How much improvement did it make? Rather then “it seems faster”, you can have hard and fast numbers to back it up.

The Monitoring

If you have zero monitoring, you need to implement at least the following. Create the following script and run daily at midnight via cron. I’ve made the script as simple as can be.

#!/bin/sh
SCRIPT_NAME="monitor"
DATETIME=`date +%Y%m%d.%H%M`
HOSTNAME=`hostname -s`
LOG_DIR="/tmp"

vmstat 5 17280 > $LOG_DIR/os.vmstat.5.$HOSTNAME.$DATETIME.log &
iostat -x 5 17280 > $LOG_DIR/os.iostat.5.$HOSTNAME.$DATETIME.log &
exit 0

If you don’t have iostat installed, you will find iostat/sar as part of the sysstat package that most distributions will have generally available.

The results

vmstat output of relative idle system.

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 304332 289484 435052    0    0     3     4  206  177  1  0 98  0
 0  0      0 304108 289496 435072    0    0     0    45  411  736  5  0 95  0
 0  0      0 304116 289500 435072    0    0     0    13  395  716  3  0 97  0

vmstat of system under disk load

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 234784 289844 516800    0    0     3     4  206  177  1  0 98  0
 1  1      0 101636 289988 645552    0    0     1 33190  448  626 35  5 31 29
 1  2      0  33092 280888 723276    0    0     4 17233  469  616 24  4  6 66
 1  0      0  62876 177444 797056    0    0     2 14846  837  938 33  5 31 30
 1  1      0  33252 168440 834064    0    0     1 30376  969  904 31  6 22 41

iostat of system under load

 iostat -x 5
Linux 2.6.22-14-generic (newyork)       02/16/2008

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.31    0.00    0.29    0.15    0.00   98.26

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.83     5.82    0.71    0.94    13.04    53.56    40.53     0.09   52.60   2.08   0.34
sr0               0.00     0.00    0.00    0.00     0.00     0.00    10.40     0.00  237.00 235.00   0.01

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          41.30    0.00    5.70   29.00    0.00   24.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  5153.60    0.60   75.80     4.80 42774.40   559.94    81.57 1354.93  13.09 100.00
sr0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.63    0.00    4.10   61.36    0.00    3.90

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  3850.00    0.20   60.00     1.60 39763.20   660.54    83.68 1450.90  13.86  83.44
sr0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          22.10    0.00    3.80   42.30    0.00   31.80

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  7645.80    0.40   66.80     3.20 44009.60   654.95   100.34 1192.33  14.14  95.04
sr0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

The analysis

The below graph shows CPU usage over a few hours. What was only observed via the graph was a spike of CPU that occurred every 5 minutes. In fact it occurred just on one server of several servers, and this enabled analysis to find a rouge cron job running every 5 minutes.

The below graph shows CPU idle usage over a 24 hour period. The colors represent different usage patterns such as during the day, batch processing overnight etc.

The below graph shows IO utilization over the same 24 hour period.

NOTE: Color’s added manually to highlight points.
These past two graphs highlight a number of things that require further analysis, but as in this example, it only shows part of a bigger picture. For example only CPU idle is graphed, not Wait I/O for example which as you can see from the I/O is excessive at times.

Still, these graphs were created from grep’ing the appropriate data into a CSV file, loading into Open Office, and click just the graph button, about a 60 second process for each type of data. This alone gave a better picture of what was happening then no picture.

What’s next

This level of monitoring is impractical for production systems, especially if you are monitoring more then one system. You need to monitoring more things, compare different results at the same time, and get better results at hand instantly. In my next topic “Monitoring 201″ I’ll talk about a number of easy and practical options.