The problem
Hands up those that don’t monitor their production MySQL web server. I’m a little surprised by this, but I’ve visited several clients that have absolutely no monitoring other then “the customers will tell us when something is wrong”. The lack of system monitoring is one of the topics in my book “101 ways to screw up a successful startup”.
Why is monitoring important? First it can tell you when something is wrong, most monitoring systems introduce some level of constraints that trigger notifications via email, SMS or red flashing screens. Second, and I consider more important, is it allows you to analyze change and compare results over time. Let’s say you added more memory to your server, and then remembered to also increase the MySQL buffers appropriately. How much improvement did it make? Rather then “it seems faster”, you can have hard and fast numbers to back it up.
The Monitoring
If you have zero monitoring, you need to implement at least the following. Create the following script and run daily at midnight via cron. I’ve made the script as simple as can be.
#!/bin/sh SCRIPT_NAME="monitor" DATETIME=`date +%Y%m%d.%H%M` HOSTNAME=`hostname -s` LOG_DIR="/tmp" vmstat 5 17280 > $LOG_DIR/os.vmstat.5.$HOSTNAME.$DATETIME.log & iostat -x 5 17280 > $LOG_DIR/os.iostat.5.$HOSTNAME.$DATETIME.log & exit 0
If you don’t have iostat installed, you will find iostat/sar as part of the sysstat package that most distributions will have generally available.
The results
vmstat output of relative idle system.
$ vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 304332 289484 435052 0 0 3 4 206 177 1 0 98 0 0 0 0 304108 289496 435072 0 0 0 45 411 736 5 0 95 0 0 0 0 304116 289500 435072 0 0 0 13 395 716 3 0 97 0
vmstat of system under disk load
$ vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 234784 289844 516800 0 0 3 4 206 177 1 0 98 0 1 1 0 101636 289988 645552 0 0 1 33190 448 626 35 5 31 29 1 2 0 33092 280888 723276 0 0 4 17233 469 616 24 4 6 66 1 0 0 62876 177444 797056 0 0 2 14846 837 938 33 5 31 30 1 1 0 33252 168440 834064 0 0 1 30376 969 904 31 6 22 41
iostat of system under load
iostat -x 5 Linux 2.6.22-14-generic (newyork) 02/16/2008 avg-cpu: %user %nice %system %iowait %steal %idle 1.31 0.00 0.29 0.15 0.00 98.26 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.83 5.82 0.71 0.94 13.04 53.56 40.53 0.09 52.60 2.08 0.34 sr0 0.00 0.00 0.00 0.00 0.00 0.00 10.40 0.00 237.00 235.00 0.01 avg-cpu: %user %nice %system %iowait %steal %idle 41.30 0.00 5.70 29.00 0.00 24.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 5153.60 0.60 75.80 4.80 42774.40 559.94 81.57 1354.93 13.09 100.00 sr0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 30.63 0.00 4.10 61.36 0.00 3.90 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 3850.00 0.20 60.00 1.60 39763.20 660.54 83.68 1450.90 13.86 83.44 sr0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 22.10 0.00 3.80 42.30 0.00 31.80 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 7645.80 0.40 66.80 3.20 44009.60 654.95 100.34 1192.33 14.14 95.04 sr0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The analysis
The below graph shows CPU usage over a few hours. What was only observed via the graph was a spike of CPU that occurred every 5 minutes. In fact it occurred just on one server of several servers, and this enabled analysis to find a rouge cron job running every 5 minutes.
The below graph shows CPU idle usage over a 24 hour period. The colors represent different usage patterns such as during the day, batch processing overnight etc.
The below graph shows IO utilization over the same 24 hour period.
NOTE: Color’s added manually to highlight points.
These past two graphs highlight a number of things that require further analysis, but as in this example, it only shows part of a bigger picture. For example only CPU idle is graphed, not Wait I/O for example which as you can see from the I/O is excessive at times.
Still, these graphs were created from grep’ing the appropriate data into a CSV file, loading into Open Office, and click just the graph button, about a 60 second process for each type of data. This alone gave a better picture of what was happening then no picture.
What’s next
This level of monitoring is impractical for production systems, especially if you are monitoring more then one system. You need to monitoring more things, compare different results at the same time, and get better results at hand instantly. In my next topic “Monitoring 201″ I’ll talk about a number of easy and practical options.