A reliable and dependable application requires observability

Observability (o11y) is a critical pre-requisite component in software architecture when advocating for and preparing organizations for making informed decisions on the success of their application. Open Telemetry from the Cloud Native Computing Foundation is the goto standard regardless of your choices of monitoring tools. However, observability is just a building block that I need to explain when advocating for having reliable and dependable systems. Observability will not inform you “Was the customer actually impacted, or how many, or how long?”. Observability will not tell you “the root cause of a problem?”.

My five layers of building blocks for Reliability are:

  1. Observability – The collection of telemetry (metrics, traces and logs) should just be there. If you are using Kubernetes (k8s) and a Java/Python/Node.js application it is already built in. Just do it.
  2. Reproducibility – The ability with a known set of steps and a given configuration and setup you can reproduce an outcome showing the same observed results is a necessary pre-requisite for any feature development or bug fixes.
  3. Testability – After being able to consistently reproduce an observed event with measurable results, the running of various experiments using a variety of changes enables you to adequately test future improvements or corrections to the initial situation, whether it’s a known bug, or a new piece of functionality. Reproducible and consistent testing is an essential component to the release of software for a reliable application.
  4. Scalability – It is impossible to adequately test a system to failure without an observable, reproducible, and testable framework. Many organizations suffer from the management “Can we support X operations” syndrome, when instead the application should know what “X” is automatically, and have adequate safeguards in place to prevent its occurrence. The ability to proactively disable [expensive] features for the good of the entire system is not a common practice for software (aka a dark mode). In fact, many organizations do not even have the capability of customer-level and individual component-level feature flags or related rate limits that can manually be implemented.
  5. Dependability – A reliable, highly available, and dependable application requires all of the prior layers to be in place to give a level of assurance to your customers and your company that your product is dependable.

AWS RDS Aurora wish list

I’ve had this list on a post-it note on my monitor for all of 2022. I figured it was time to write it down, and reuse the space.

In summary, AWS suffers from the same problem that almost every other product does. It sacrifices improved security for backward compatibility of functionality. IMO this is not in the best practices of a data ecosystem that is under constant attack.

  • Storage should be encrypted by default. When you launch an RDS cluster its storage is not encrypted. This goes against their own AWS Well-Architected Framework Section 2 – Security.
  • Plain text passwords. To launch a cluster you must specify a password in plain text on the command line, again not security best practice. At least change this to using a known secret from AWS secrets manager.
  • TLS for administrative accounts should be the only option. The root user should only be REQUIRE SSL (MySQL syntax).
  • Expanding on the AWS secrets manager usage for passwords, there should not need to be lambda code and cloudwatch cron event for rotation, it should just be automatically built in.
  • The awscli has this neat wait command that will block until you can execute the next statement in a series of sequential events to prepare and launch a cluster, but it doesn’t work for create-db-cluster. You have to build in your own manual “wait” until “available” process.
  • In my last position, I was unable to enforce TLS communications to the database from the application. This insecure practice is a more touchy situation, however, there needs to be some way to ensure security best practices over application developer laziness in the future.
  • AWS has internal special flags that only AWS support can set when say you have a bug in a version. Call it a per-client feature flag. However, there is no visibility into what is set, which account, which cluster, etc. Transparency is of value so that the customer knows to get that special flag unset after minor upgrades.
  • When you launch a new RDS Cluster, for example, MySQL 2.x, you get the oldest version, back earlier in the year it was like 2.7.2, even when 2.10.1 was released. AWS should be using a default version when only an engine is specified as a more current version. I would advocate the latest version is not the automatic choice, but it’s better to be more current.
  • the ALTER SYSTEM CRASH functionality is great, but it’s incomplete. You cannot for example crash a global cluster, forcing a region-specific failover. If you have a disaster resiliency plan that is multi-region it’s impossible to actually test it. You can emulate a controlled failover, but this is a different use case to a real failover (aka Dec 2021)
  • Use arn when it’s required not id. This goes back to my earlier point over maximum compatibility over usability, but when a --db-instance-identifier, or --db-instance-identifier requires the value to be the ARN, then the parameter should be specific. IMO –identifier is what you use for that argument, e.g. --db-cluster-identifier. When you specify for example --replication-source-identifier this must be (as per docs) “The Amazon Resource Name (ARN) of the source DB instance or DB cluster if this DB cluster is created as a read replica.” It should then be --replication-source-arn. There are a number of different occurrences of this situation.

Our Data Security Moonshot Starts With Prevention

The recent re-announcement of the Cancer Moonshot highlighted a common enemy to many endeavors to improve our society as a whole, and that is using common sense and already known methods.

At a high level The goal of the Cancer Moonshot Scholars program is to inspire and support the next generation of world-class and diverse researchers focused on scientific breakthroughs that will make a difference for patients and drive progress toward the goal of ending cancer as we know it today. source fact sheet

As stories of this announcement filtered thru news outlets with interviews of medical professionals, a known thread appeared. Both lacking in the message, and the single greatest advancement to the problem, which is already known, is prevention. This includes known prevention measures, early detection measures, and education.

As a Data Strategist, Data Security is a critical component of any business and the single best defense is prevention and using common sense.

Here are just some simple basics that seem to have to be discussed and argued repeatedly company after company, product after product, yet there is no single effort to eliminate these poor practices.

  • No clear text passwords. If you have to enter a password on the command line (cough cough AWS CLI) or put a clear text password in a configuration file (cough cough 100s of products), you enable simple techniques to obtain unvetted access to your data.
  • Using clear text passwords is amplified when products offer a more secure means of access and identity management but they do not employ it everywhere.Check out Password Plaintext Storage
  • Clear text transport. It pains me to say it but even in recent employment that held critical PII data, I could not enforce TLS communication between applications and databases. While it was as simple as a configuration option, the constant excuses by engineering management were it was too hard to implement (cough cough BS).
  • The default configuration settings for a product need to be secure, not the default that is most compatible with prior versions. For example, if you launch a new cloud instance database with defaults, is it the most secure options, or the least secure options>
  • Credential rotation. Long-lived credentials should just be eliminated. Often these are also not named users, but commonly used processes.
  • Communicating passwords in clear text. This should never ever happen, yet it does. Have you ever received an (insecure protocol) email saying here is your username and password? A short known list of 5880+ sites can be found in the https://plaintextoffenders.com/ list on github offenders.csv.
  • Data systems accessible via the public internet For example MongoDB article, MySQL/MariaDB article, Redis & ElasticSearch etc, etc
  • Data systems that have no credentials required
  • Data systems that have default credentials that were never changed
  • Storing passwords in clear text
  • Storing passwords with a single salt
  • Storing passwords with a symmetric encryption approach
  • Administrators that use a common account for “root” privileges, not individual named accounts
  • Not patching products with fixed vulnerabilities CVS Program Mission
  • and the list could go on and on….

In all of the above points, there are numerous examples of these data security anti-patterns. While many are due to the products in use, some of these examples represent poor business practices. It should not have to be explained that most attacks and breaches are internal. The common and very incorrect attitude of, we are within our Virtual Private Cloud (VPC) we do not need to encrypt our data is well, plainly wrong.

Transparency

One of the greatest threats to businesses is ransomware. Attackers gain access to system via various means, those above are just the simplest means and then hold businesses ransom. Ransomware has multiple impacts including the loss of a business operating, the process and time of making a decision, the penalty for payment to release the random, and generally the threat of release of their data if a fine is not paid.

There is a lot to unpack even with this ransomware statement. Can you not restore your entire business operations within a suitable RTO and RPO? Is important data not encrypted. Are passwords in your business able to de-encrypted (this should never even be possible). Do you have a disaster recovery (DR) strategy? Can you access critical data via others means and systems independently?

The stigma of a ransomware attack is organizations do not share this openly. They do not share why it happened, what could have been done to prevent this, and sharing all information with federal authorities that should be tracking all occurrences. This information is an important and critical education feedback loop for the whole industry and IMO lacking of attention. Do you know of a website that shared known ransomware attack vectors.

Conclusion

If security is an important aspect to the data in your organization, can you name the people in your security department? Can any individual point out an insecure product with a known fixed vulnerability? Is that information transparent? Is there a process to address that as a top priority, moving engineering and operations goals accordingly? While organizations may employ an error budget for outages, do they employ a security vulnerability budget? Do companies note version updates of all their software, have people read ALL the release notes of each point release, or even know every version of each software product in use in the organization?

For more information, check out

There will always be better and more determined attempts to attack data systems, we have to stop the most obvious first, and we have to participate in identification and remediation endeavors.

Using a simple relatable example to every person, your home. We should start with not leaving the door open, or leaving the keys in the door or simply removing the door all together.

Spoiler – Owning your data isn’t good enough

While this is a catchy title, if you use Software as a Service (SaaS), or an online cloud provider, do you actually own and have total control of your business data and its infrastructure? For all the free and paid services your business uses, what happens if one day, a portion of that were no longer available?

When you have data in a CRM, an analytics platform, a marketing platform, a payments platform, if one of those service providers locks you out of your data, you have lost control and access to a part of your business. Can you still operate unaffected? What is the actual impact? What is your contingency? You could be lucky and the impact is temporary, such as a day or a week, but it could also be longer or even indefinite.

Let me give you a simple but concrete example. Fellow woodworker Eric of Spencley Design posted recently on YouTube “I just lost half of my business”. If you listen to just 2 1/2 minutes from 12:00 to 14:30 of his youtube video explanation you will understand that this business relies on several online SaaS services. Many are free, but for an unexplained reason, whether bad code, bad ML/AI, or several other plausible reasons, one of his income streams was shut down without notice. This was not by his doing, or any of his actions but for unrelated reasons. Online attempts to appeal this situation caused a permanent suspension. Talking to a human to understand what happened, why it happened, and how this can be resolved, was also unanswered because there is no ability to physically speak to a human.

This problem is not limited to online services. A great example of just a decade ago is your business credit card stops working, transactions are declined. If you were lucky you could physically call your bank manager, or go to your bank manager to get to the bottom of the situation. You knew your bank account contained sufficient funds as you maintained on-premise accounting practices and you could provide evidence of such facts. If you run a small business today, do you think you can talk to a human that would have the ability to correct this problem, or would you have to talk to 5 humans, multiple automated (and annoying) systems, costing countless hours of time and frustration?

If you rely on Acme George Inc workspaces product for your small business email and shared documents, what if that becomes blocked? How do you communicate with your customers? What if you use Acme Archie Inc for your customer support ticketing system, and for a week it is unavailable to use? Not only can your customers not report issues, but you have no access to see what issues were already outstanding and work on them independently.

At times there are widespread outages of online presences that have a wide effect across industries from hours to weeks. Cloudflare Jun 21, 2022, Fastly June 8, 2021, Amazon Web Services Dec 7, 2021, and then Dec 15 and Dec 22. A blog post called it the AWS’s December Outagepalooza. The Atlassian April 2022 outage for paying customers lasted upto 2 weeks. Even a free social media company and its related entities incurred widespread impact Facebook Oct 4 ,2021 that affected many gig economy businesses. These outages can have far ranging effects. Actual examples include you cannot pay your employees, your staff at a hospital cannot authenticate to access patient records, transportation and logistics of your shipping business is halted.

I am referring here to loss of access to your data in a SaaS environment, and loss of cloud infrastructure that supported your SaaS services or even your internally developed and maintained systems running on cloud infrastructure. If you are not convinced of the larger ramifications of an extreme loss of infrastructure services what was the impact to Parler in 2021.

My point here is you cannot simply stop using these services, or your cloud provider(s) infrastructure. You need to be prepared. In a traditional system, you backup your data for some degree of disaster, and you support the capability to recover both infrastructure and data from this, and if you a smart you actually test this. Sidebar a colleague recently shared that even with massive investment in infrastructure and global redundancy, a scheduled test for this large bank took down services for 12 hours.

Large SaaS organizations could offer services that offer multi-region or multi-cloud capabilities, but they are also at the mercy of the SaaS providers they use. Do you know all the interdependencies? Look no further than the wipe out of Okta’s stock (down 30%) in one day. CEO of Okta Todd McKinnon cited several factors including a security impact by text message provider Twilio. Read more about that at Twilio Employeee, Customer Accounts Breached Through Texts. And yes, the headline here has an incorrect spelling. I tried to add a comment to offer feedback, but the MarketWatch paywall of 4 articles would not let me create an account to login to leave a comment!

The solution is not to host all of your own infrastructure either. Facebook’s very long outage was self-inflicted and they controlled all of their own infrastructure. It not only had an impact on their websites, their internal staff were unable to use security badges to access critical infrastructure to correct the problem because they were physically locked out of buildings holding the infrastructure.

Returning to the small business owner who uses a marketing platform, an analytics platform, a CRM, a payment platform or even a social media platform. Do you keep current copies of your data in these systems so that if there were a loss, you knew who to communicate with? In the first cited case, did Eric have a list of all of his subscribers, a copy of all his online content, and all comments made by subscribers. Was there a means to communicate with them via other means, or was access to sufficient PII not even possible for what was his original content?

In future posts I will share some of my techniques for ensuring you have a data acquisition strategy.