A reliable and dependable application requires observability

Observability (o11y) is a critical pre-requisite component in software architecture when advocating for and preparing organizations for making informed decisions on the success of their application. Open Telemetry from the Cloud Native Computing Foundation is the goto standard regardless of your choices of monitoring tools. However, observability is just a building block that I need to explain when advocating for having reliable and dependable systems. Observability will not inform you “Was the customer actually impacted, or how many, or how long?”. Observability will not tell you “the root cause of a problem?”.

My five layers of building blocks for Reliability are:

  1. Observability – The collection of telemetry (metrics, traces and logs) should just be there. If you are using Kubernetes (k8s) and a Java/Python/Node.js application it is already built in. Just do it.
  2. Reproducibility – The ability with a known set of steps and a given configuration and setup you can reproduce an outcome showing the same observed results is a necessary pre-requisite for any feature development or bug fixes.
  3. Testability – After being able to consistently reproduce an observed event with measurable results, the running of various experiments using a variety of changes enables you to adequately test future improvements or corrections to the initial situation, whether it’s a known bug, or a new piece of functionality. Reproducible and consistent testing is an essential component to the release of software for a reliable application.
  4. Scalability – It is impossible to adequately test a system to failure without an observable, reproducible, and testable framework. Many organizations suffer from the management “Can we support X operations” syndrome, when instead the application should know what “X” is automatically, and have adequate safeguards in place to prevent its occurrence. The ability to proactively disable [expensive] features for the good of the entire system is not a common practice for software (aka a dark mode). In fact, many organizations do not even have the capability of customer-level and individual component-level feature flags or related rate limits that can manually be implemented.
  5. Dependability – A reliable, highly available, and dependable application requires all of the prior layers to be in place to give a level of assurance to your customers and your company that your product is dependable.

Writing re-runable shell script

I recently started playing with devstack again (An all-in-on OpenStack developer setup). Last time was over 3 years ago because I remember a pull request for a missing dependency at the time.

The installation docs provide information to bootstrap your system with a necessary user and privileges, however like many docs for software setup they contain one off instructions.

adduser stack
echo "stack ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

When you write operations code you need to always be thinking about “testability” and “automation”. It is important to write re-runable code. You should always write parameterized code when possible, which can be refactored into usable functions at any time.

This is a good example to demonstrate a simple test condition for making the initial instructions re-runable.

sudo su -
NEW_USER="stack"
# This creates default group of same username
# This creates user with default HOME in /home/stack
[ `grep ${NEW_USER} /etc/passwd | wc -l` -eq 0 ] && useradd -s /bin/bash -m ${NEW_USER}
NEW_USER_SUDO_FILE="/etc/sudoers.d/${NEW_USER}"
[ ! -s ${NEW_USER_SUDO_FILE} ] && umask 226 && echo "${NEW_USER} ALL=(ALL) NOPASSWD: ALL" > ${NEW_USER_SUDO_FILE}
ls -l ${NEW_USER_SUDO_FILE}

Testability

If I was to provide one tip for organizations on how to implement a successful technology solution, I would state you need to ensure your product/software/system is completely testable. Independent on how you elect to test your system, the design of creating a completely testable infrastructure will enable exponential savings as your business grows.

You achieve this by implementing an Application Programming Interface (API) for all data access. Your goal should be to move away from technology dependence and towards a technology agnostic solution, your dependency is now your business specification. This does not mean you are going to expose this API to the Internet, your own applications are your first clients, your web site and your management reporting tools. Your website is just a client presentation of your most valuable asset, your information.

Creating an environment that enables you test and verify your information independently from how is renders in a browser, enables a complete level of possible automation for testing this component of your communication channel. While end to end testing is also necessary, this becomes more complex and is impractical if this is your only means of testing. The principle of any popular Agile methodology approach is around testing where one popular term is Test Driven Development (TDD). While you may not implement TDD, knowing and applying the principals enables testability.

As you continue to grow, you will realize you now have the infrastructure and ability to stress test your most important system features. It is a common misconception that testing is about ensuring your software works as designed. Testing should not be about what works, but what doesn’t break. The goal of testing should be to break your software. The ability to stress test your system is to know when your system will fail. This ability to predict can benefit you ahead of time. You do not want your startup to suffer a successful catastrophe where you meet all your marketing goals, but you system crashes, and while the “Twitter failed whale” is frustrating, this is one approach attempt to mediate a total failure.