Simple steps to increasing site availability

A recent database production migration with a large client highlighted a fundamental flaw in their designed architecture for suitable site availability. While the development team had take several good steps in improving scalability of the site, there was a clear failure in understanding and supporting different levels of data availability which I cover in my presentation Successful Scalability Principles.

It was the decision of the development manager to shut down the entire site to perform a final DB migration. The downtime was only 60 seconds but this approach was completely unnecessary with any user requests simply being rejected without any explanation.

The Problem

The system had already be siloed/partitioned/sharded into 5 distinct sources of information. 4 of these data sources in MySQL had applicable read and write capacity (i.e. MySQL replication), and application configuration to support reading data not from the primary data source. Both of these principles are good steps towards scalability and performance. What was lacking was availability.

The wrong way

The migration of the final partition involved moving from AWS RDS to AWS EC2 instances running MySQL. This final all important module managed advertisements, campaigns and ad tracking required that no data was lost.

In AWS, the approach taken was to remove approximately 60 webservers from the public load balancer (ELB). The result of this was all requests, some 20,000 to 25,000 requests simply hung or produced a likely HTTP 500 error.

This was the first fundamental flaw. What does your website look like when it is unavailable? In this case this was never considered or planned for. At worse, all sites should have an emergency “site unavailable due to maintenance” page, trivially managed by a second virtual host in your apache web server configuration. This can be enabled with zero downtime. While inconveniencing the end user, you are informing the end user and they will be more receiving of proactive information.

The second fundamental flaw is that the unavailability of one part of the system, should not affect the entire system if there is no interaction. There are 5 distinct and standalone partitions, only 1 required downtime.

The Right Way

In this situation there was more then one approach to minimize downtime while switching data sources and to ensure all data was captured.

Most sites fail with the fundamental principle of supporting different levels of data availability. In this specific case, one partition (i.e. 1/5 of the data) would be unavailable. Why should that situation effect 100% of your website? Furthermore, only the ability to write was affected, why then should that affect the ability to read ads.

There are at least four types of data availability. Specifically the ability to write data, read data, read cached data and no data access. There are also more fine grained methods of which I will also discuss one.

Defining your data availability requires your application to support and manage data access. This is not easy if you application was not developed with this in mind. I will give you a simple example. Many popular LAMP frameworks including Drupal & WordPress were never designed for read scalability. They relied on a single MySQL server. The act of scaling reads, and providing a read-only site is an after thought and many website struggle to create creative ways to support this primary architectural design pattern.

Knowing that a user request requires the ability to read and/or write data is the first key step. Knowing what type of data is the second. Providing a messaging system between what levels of data access there is, and the ability to turn off features while maintaining site uptime is critical for improving site availability.

More advanced approaches then consider the role of caching data. Generally sites will use caching to assist in reads, but caching can also be implemented to support non critical writes. In this particular example, a write to cache presented a small but tangible risk for data loss. The solution was to implement a secondary logging strategy. This is a separate persistent write capability during the downtime, and the ability to replay. By limiting the writes to log only (i.e. write once) operations, it became very simple to migrate from one system to a second system, logging and reapplying all data changes and ensuring no site downtime, and no data loss.

Conclusion

Managing site availability comes back to a very important question. Clearly define your uptime needs.

Performance v Scalability – For Employers

In a recent discussion with a fellow peer reviewing a job description he was applying for, we got into a discussion on the specifics of a Performance Engineer verses a Scalability Engineer.

Performance and Scalability are two very different goals. While it is true that improving performance can lead to increased scalability capacity with the same physical resources, increasing the scalability of your application does not necessarily lead to improved performance.

Performance is all about perception. In layman’s terms, how quickly can you provide a response to a request from your customer. As volume increases, performance generally degrades after a certain point, and then as volume continues, often the outcome is complete failure. Having a suitable scalable architecture can enable you to provide consistent performance for a given and growing workload.

A Scalability Engineer needs to have architectural skills, management skills, deployment skills and automation skills. A Performance Engineer needs to have more specific technology skills, development skills and some architectural skills.

A great example of a performance problem is when a client contacts me to help with a slow performing website. When the home page takes 5 seconds to load, but only 500ms of that is the actual page generation, and ultimately the maximum possible amount of time spent in the database, in isolation as a database expert I could only improve on 10% of the actual problem. As a performance engineer, your knowledge of the full stack including the web container, the data store accesses (persistent and non-persistent), optimizing the network payload size with compression, various techniques of caching and parallelism capacities are all essential skills needed.

A scalability problem is when your site supports 5,000 concurrent users, but it needs to support 25,000. Applying the primary skills just listed will not solve your scalability need. Simply adding 5x of servers is a simple way to provide support for more concurrent users, but where is the bottleneck or limitation of your application as you scale. Does adding 5x web servers place too much load on your caching tier or your database tier? While most applications utilize load balancing for web traffic, and so a new webserver is generally straightforward (to a point), can your application even support adding more database servers? Or does your architecture lead to read scalability, but not write scalability? Not being able to scale writes is a clear single point of failure for scalability. Most scalability needs require (re)architecture of your stack and the management of how this can be achieved while maintaining an operational site. After a point when you have 500+ servers, adding 50 more servers is generally the role of great automated deployment processes. The problem is usually greater when moving from 5 servers to 25 servers.

For employers that are writing a job description and using a specific job title, consider if the objectives in the description matches the title.

This leads to the question, what about a Reliability Engineer? That is another detailed discussion that relates to performance and scalability, but also have very different goals. Clearly defining your uptime needs is just one question a reliability engineer needs to ask.

Clearly define your uptime needs

In writing about Performance and Scalability I referenced a quote that I have provided in a number of presentations regarding a valuable interaction with a client. All software architects and managers need to clearly understand this for their own sites in order to enable technical resources to deliver a highly scalable solution.

Development Manager:  We need a maintenance window for software upgrades and new releases.
CTO:  No Downtime.
Development Manager: But we need this to fix problems and improve performance.
CTO:  No Downtime.
Consultant (aka Ronald Bradford):  Mr CTO. What is your definition of no downtime?
CTO:  We serve pages, we serve ads.
Consultant: We can do that.

Asking the right question about the uptime requirements completely changed the architecture needed to meeting these specific high availability needs.

It is important to know with this major TV network client the answer was not updating content, selling merchandise or enabling customers to comment. Each of these needs requires a different approach to high availability.