Posts Tagged ‘disaster’

Basic scalability principles to avert downtime

Saturday, April 23rd, 2011

In the press in the last two days has been the reported outage of Amazon Web Services Elastic Compute Cloud (EC2) in just one North Virginia data center. This has affected many large website includes FourSquare, Hootsuite, Reddit and Quora. A detailed list can be found at ec2disabled.com.

For these popular websites was this avoidable? Absolutely.

Basic scalability principles if deployed in these systems architecture would have averted the significant downtime regardless of your development stack. While I work primarily in MySQL these principles are not new, nor are they complicated, however they are fundamental concepts in scalability that apply to any technology including the popular MongoDB that is being used by a number of affected sites.

Scalability 101 involves some simple basic rules. Here are just two that seem to have been ignored by many affected by this recent AWS EC2 outage.

  1. Never put all your eggs in one basket. If you rely on AWS completely, or you rely on just one availability zone that is putting all your eggs in one basket.
  2. Always keep your important data close to home. When it comes to what is most critical to your business you need access and control to your information. At 5am in the morning when the CEO asks how long will our business be unavailabla and what is needed to resolve the problem, the answer “We have no control over this and have no ETA” is not an acceptable answer.

With a successful implementation and appropriate data redundancy you may not have an environment immediately available however you have access to your important information and the ability to create one quickly. Many large hosting companies can provide additional H/W on near demand, especially if you have an initial minimal footprint. Indeed using Amazon Web Services (AWS) as a means to avert a data center disaster is an ideal implementation of Infrastructure As A Service (IAAS). Even with this issue, organizations that had planned for this type of outage could have easily migrated to another AWS availability zone that was unaffected.

Furthermore, system architecture to support various levels of data availability and scalability ensure you can handle many more various types of unavailability without significant system down time as recently seen. There are many different types of availability and unavailability, know what your definition of downtime is and supporting disasters should be your primary focus of scalability, not an after thought.

As an expert in performance and scalability I can help your organization in the design of a suitable architecture to support successful scalability and disaster. This is not rocket science however many organizations gamble without the expertise of a professional to ensure business viability.

Checked your MySQL recovery process recently?

Monday, February 15th, 2010

I sound like a broken record with every client when I talk to about the resilience of their production environments. It’s very simple in theory, however in practice many organizations fail.

Ask yourself these checklist questions for your MySQL backup and recovery process?

  1. Do you have MySQL backups in place?
  2. Do you backup ALL your MySQL data?
  3. Do you have consistent MySQL backups?
  4. Do you have backups that include both static snapshot and point in time transactions?
  5. Do you review your backup logs EVERY SINGLE day or have tested backup monitoring in place?
  6. Do you perform a test recovery of your static backup?
  7. Do you perform a test recovery to point in time?
  8. Do you time your backup and recovery process and review over time?
  9. Do you have off-site copies of your backups?
  10. Do you backup your primary binary logs?

In the past month I’ve discovered clients that have an online only business (i.e. Their MySQL data is their only tangible asset), they perform daily backups but they don’t have binary logging enabled. I’ve also discovered an example of backup logs not being checked, and an underlying mysqldump error was resulting in an incomplete backup, yet the backup script apparently completed successfully.

Disaster is inevitable.

If you don’t score 8 or better in the above checklist in your business, you are at higher risk. If you are a owner/founder/executive this should keep you awake at night if your not sure of your business viability. If your organization needs help, please contact me for assistance.

  • Can you recover from a small or large disaster?
  • Do you have confidence in your DR plan?
  • Do you know how long your DR plan will take.
  • What does your online business look like or operate during your DR time?

Calculating your database size

Friday, September 25th, 2009

I generally use the following MySQL INFORMATION_SCHEMA (I_S) query to Calculate Your MySQL Database Size. This query and most others that access the MySQL INFORMATION_SCHEMA can be very slow to execute because they are not real tables and are not governed by physical data, memory buffers and indexes for example but rather internal MySQL data structures.

Mark Leith indicates in his post on innodb_stats_on_metadata that Innodb performs 8 random(ish) dives in to the index, when anybody accesses any of SHOW TABLE STATUS, SHOW INDEX, INFORMATION_SCHEMA.TABLES,INFORMATION_SCHEMA.STATISTICS for InnoDB tables. This can have an effect on performance, especially with a large number of Innodb tables, and a poor ratio of innodb_buffer_pool_size to disk data+index footprint.

What is even more incredible is when the result of this apparently harmless query causes the mysqld process to actual crash with a core dump due to these random index dives. The following core dump analysis highlights my query as the cause of the problem. This has happened now at least twice in for recent core crashes on a production environment.

(gdb) bt
#0 0x000000327280b6b2 in pthread_kill () from ./lib64/libpthread.so.0
#1 0x000000000055b136 in handle_segfault ()
#2 
#3 0x00000000007e1c21 in rec_get_offsets_func ()
#4 0x0000000000766007 in btr_estimate_number_of_different_key_vals ()
#5 0x000000000070d4c2 in dict_update_statistics_low ()
#6 0x000000000061fa84 in ha_innobase::info ()
#7 0x0000000000636972 in fill_schema_charsets ()
#8 0x0000000000639a66 in get_all_tables ()
#9 0x0000000000634633 in get_schema_tables_result ()
#10 0x00000000005bde37 in JOIN::exec ()
#11 0x00000000005bf7a7 in mysql_select ()
#12 0x00000000005c0127 in handle_select ()
#13 0x000000000056fcf0 in mysql_execute_command ()
#14 0x0000000000574c83 in mysql_parse ()
#15 0x00000000005751a0 in dispatch_command ()
#16 0x0000000000576483 in do_command ()
#17 0x0000000000577002 in handle_one_connection ()
#18 0x0000003272806367 in start_thread () from ./lib64/libpthread.so.0
#19 0x0000003271cd30ad in clone () from ./lib64/libc.so.6
Cannot access memory at address 0x3271cd3040 

This is an information_schema query that caused innodb to open a table.
This is totally normal. On first open, innodb tables get automatically  analyzed.
This analyze process crashed in innodb. 

This exact query *provoked* a crash: 

(gdb) x/1s 0x00002aaabc961dd0
0x2aaabc961dd0: "SELECT table_schema,table_name,engine,row_format,
table_rows, avg_row_length,
(data_length+index_length)/1024/1024 as total_mb,
(data_length)/1024/1024 as data_mb,
(index_length)/1024/1024 as index_mb,
CURDATE() AS today
FROM information_schema.tables
WHERE table_schema=@schema
ORDER BY 7 DESC"

The issue however is which table is the problem? How widespread is the corruption. Would an ALTER TABLE ENGINE=Innodb rebuild the table and eliminate the problem. Would an ANALYZE on an Innodb table identify the problem? (I doubt this second point). The problem however is even more significant due to the actual system. The largest single table of this 1TB database is 500GB. The impact of performing the ALTER, the time to undertake this blocking operation, the increase in the Innodb data file that can’t be reclaimed are just two factors that the inexperienced may fall victim of.

A saying I use is “Disaster is inevitable”. In this situation the disaster appears to not be significant but the ramifications due to the lack of appropriate and expert architectural design considerations to correct the problem are.

Is your environment capable of supporting this maintenance requirement? If not, then is the decision maker in your organization worried enough to seek the expert advice to address pro actively or will it be too late.

Never let your binlog directory fill up

Wednesday, July 15th, 2009

Recently with a client while running a number of disaster recovery tests I came across a nasty situation which was not part of the original plan and provided a far worse disaster situation then expected.

I should preface this condition with some of the environment conditions.

  • MySQL 5.0 Enterprise 5.0.54
  • RHEL 5 64bit
  • Master and 2 Slaves
  • MySQL Data and MySQL Binary Logs/MySQL Error Logs are on separate disk partitions

While running stress tests under high load, we tested the filling of partition containing the logs. This partition included the binary log and MySQL error log.

The observed output was.

  • An error message was written to the MySQL error log. See below.
  • Application throughput dropped, but did not stop.
  • Binary logs stopped occurring.
  • MySQL proactively stopped logging but continued to process transactions.

The end result was:

  • The error log was ultimately truncated after reboot, so if the information was not captured while this was in action, this important messages would be lost.
  • The primary slave used for read scalability and the secondary slave used for backups are now unusable.
  • The backup and recovery process using slaves and point in time recovery via binary logs is not unusable.
  • The three backup methods in use for the client are ineffective. It was necessary to disable access to the Master, take a full backup, and then sync the slaves from this copy.
090710 19:01:25 [ERROR] /opt/mysql/bin/mysqld: Disk is full writing '/mysqllog/binlog/hostname-3306-bin.000020'
     (Errcode: 28). Waiting for someone to free space... Retry in 60 secs
090710 19:01:46 [ERROR] Error writing file '/mysqllog/slow_log/hostname_3306_slow_queries.log' (errno: 1)
090710 19:02:25 [ERROR] Error writing file '/mysqllog/binlog/hostname-3306-bin' (errno: 28)
090710 19:02:25 [ERROR] Could not use /mysqllog/binlog/hostname-3306-bin for logging (error 28).
    Turning logging off for the whole duration of the MySQL server process. 
    To turn it on again: fix the cause, shutdown the MySQL server and restart it.

Updated

I discuss in detail the options for the MySQL error log including recommendations for the MySQL error log file location in Monitoring MySQL – The error log