A reliable and dependable application requires observability

Observability (o11y) is a critical pre-requisite component in software architecture when advocating for and preparing organizations for making informed decisions on the success of their application. Open Telemetry from the Cloud Native Computing Foundation is the goto standard regardless of your choices of monitoring tools. However, observability is just a building block that I need to explain when advocating for having reliable and dependable systems. Observability will not inform you “Was the customer actually impacted, or how many, or how long?”. Observability will not tell you “the root cause of a problem?”.

My five layers of building blocks for Reliability are:

  1. Observability – The collection of telemetry (metrics, traces and logs) should just be there. If you are using Kubernetes (k8s) and a Java/Python/Node.js application it is already built in. Just do it.
  2. Reproducibility – The ability with a known set of steps and a given configuration and setup you can reproduce an outcome showing the same observed results is a necessary pre-requisite for any feature development or bug fixes.
  3. Testability – After being able to consistently reproduce an observed event with measurable results, the running of various experiments using a variety of changes enables you to adequately test future improvements or corrections to the initial situation, whether it’s a known bug, or a new piece of functionality. Reproducible and consistent testing is an essential component to the release of software for a reliable application.
  4. Scalability – It is impossible to adequately test a system to failure without an observable, reproducible, and testable framework. Many organizations suffer from the management “Can we support X operations” syndrome, when instead the application should know what “X” is automatically, and have adequate safeguards in place to prevent its occurrence. The ability to proactively disable [expensive] features for the good of the entire system is not a common practice for software (aka a dark mode). In fact, many organizations do not even have the capability of customer-level and individual component-level feature flags or related rate limits that can manually be implemented.
  5. Dependability – A reliable, highly available, and dependable application requires all of the prior layers to be in place to give a level of assurance to your customers and your company that your product is dependable.

Performance v Scalability – For Employers

In a recent discussion with a fellow peer reviewing a job description he was applying for, we got into a discussion on the specifics of a Performance Engineer verses a Scalability Engineer.

Performance and Scalability are two very different goals. While it is true that improving performance can lead to increased scalability capacity with the same physical resources, increasing the scalability of your application does not necessarily lead to improved performance.

Performance is all about perception. In layman’s terms, how quickly can you provide a response to a request from your customer. As volume increases, performance generally degrades after a certain point, and then as volume continues, often the outcome is complete failure. Having a suitable scalable architecture can enable you to provide consistent performance for a given and growing workload.

A Scalability Engineer needs to have architectural skills, management skills, deployment skills and automation skills. A Performance Engineer needs to have more specific technology skills, development skills and some architectural skills.

A great example of a performance problem is when a client contacts me to help with a slow performing website. When the home page takes 5 seconds to load, but only 500ms of that is the actual page generation, and ultimately the maximum possible amount of time spent in the database, in isolation as a database expert I could only improve on 10% of the actual problem. As a performance engineer, your knowledge of the full stack including the web container, the data store accesses (persistent and non-persistent), optimizing the network payload size with compression, various techniques of caching and parallelism capacities are all essential skills needed.

A scalability problem is when your site supports 5,000 concurrent users, but it needs to support 25,000. Applying the primary skills just listed will not solve your scalability need. Simply adding 5x of servers is a simple way to provide support for more concurrent users, but where is the bottleneck or limitation of your application as you scale. Does adding 5x web servers place too much load on your caching tier or your database tier? While most applications utilize load balancing for web traffic, and so a new webserver is generally straightforward (to a point), can your application even support adding more database servers? Or does your architecture lead to read scalability, but not write scalability? Not being able to scale writes is a clear single point of failure for scalability. Most scalability needs require (re)architecture of your stack and the management of how this can be achieved while maintaining an operational site. After a point when you have 500+ servers, adding 50 more servers is generally the role of great automated deployment processes. The problem is usually greater when moving from 5 servers to 25 servers.

For employers that are writing a job description and using a specific job title, consider if the objectives in the description matches the title.

This leads to the question, what about a Reliability Engineer? That is another detailed discussion that relates to performance and scalability, but also have very different goals. Clearly defining your uptime needs is just one question a reliability engineer needs to ask.