In the press in the last two days has been the reported outage of Amazon Web Services Elastic Compute Cloud (EC2) in just one North Virginia data center. This has affected many large website includes FourSquare, Hootsuite, Reddit and Quora. A detailed list can be found at ec2disabled.com.
For these popular websites was this avoidable? Absolutely.
Basic scalability principles if deployed in these systems architecture would have averted the significant downtime regardless of your development stack. While I work primarily in MySQL these principles are not new, nor are they complicated, however they are fundamental concepts in scalability that apply to any technology including the popular MongoDB that is being used by a number of affected sites.
Scalability 101 involves some simple basic rules. Here are just two that seem to have been ignored by many affected by this recent AWS EC2 outage.
- Never put all your eggs in one basket. If you rely on AWS completely, or you rely on just one availability zone that is putting all your eggs in one basket.
- Always keep your important data close to home. When it comes to what is most critical to your business you need access and control to your information. At 5am in the morning when the CEO asks how long will our business be unavailabla and what is needed to resolve the problem, the answer “We have no control over this and have no ETA” is not an acceptable answer.
With a successful implementation and appropriate data redundancy you may not have an environment immediately available however you have access to your important information and the ability to create one quickly. Many large hosting companies can provide additional H/W on near demand, especially if you have an initial minimal footprint. Indeed using Amazon Web Services (AWS) as a means to avert a data center disaster is an ideal implementation of Infrastructure As A Service (IAAS). Even with this issue, organizations that had planned for this type of outage could have easily migrated to another AWS availability zone that was unaffected.
Furthermore, system architecture to support various levels of data availability and scalability ensure you can handle many more various types of unavailability without significant system down time as recently seen. There are many different types of availability and unavailability, know what your definition of downtime is and supporting disasters should be your primary focus of scalability, not an after thought.
As an expert in performance and scalability I can help your organization in the design of a suitable architecture to support successful scalability and disaster. This is not rocket science however many organizations gamble without the expertise of a professional to ensure business viability.