There are reports that your website is down. You pull up the login page without incident. What’s next?
Monitoring is critical. How detailed is this? How frequently are you sampling? The resolution to any issue is only as good as the response to a paged alert. Who is looking into the issue? What escalation exists?
In today’s complex interconnected infrastructure is it ever that simple? When speaking about an AWS hosted solution, is it an AWS Issue? Does status.aws.amazon.com give you a clue? Does the inability to access other services/sites you may be using at this moment give an indicator of a larger problem? Is it AWS related for a service, an availability zone, or even an entire region? Having experienced all of those before sometimes its obvious, sometimes it is not. Or does a Twitter Search report other shared experiences of regional outages, was it that severed Verizon underwater cable?
I learned two things this week in triage of this situation. The first is that the old CLI tools you have been using for 20+ years still help in triage quickly. D not discount them or the detail they provide. I was able to identify and reproduce an underlying cause with just
curl. For many reviewing the outage the problem did not manifest as an error. It turned out there were two distinct paths from two separate domains to the ultimate target page. This was not immediately obvious and known, and there was no definitive network diagram to describe this.
$ nslookup demo.internal-example.com demo.internal-example.com canonical name = internal.us-east-1.elb.amazonaws.com. Name: internal.us-east-1.elb.amazonaws.com Address: 10.10.1.2 Name: internal.us-east-1.elb.amazonaws.com Address: 10.10.0.3 Name: internal.us-east-1.elb.amazonaws.com Address: 10.10.2.4
$ ▶ nslookup demo.public-example.com Non-authoritative answer: demo.public-example.com canonical name = external.us-east-1.elb.amazonaws.com. Name: external.us-east-1.elb.amazonaws.com Address: 126.96.36.199 Name: external.us-east-1.elb.amazonaws.com Address: 188.8.131.52
The first indication was actually to find that one of the ELBs was not in the AWS account with all other resources, and this AWS account was not viewable. That is a separate discussion for why? curl then helped to traverse the various redirects of each ELB using these options
- -i/–include – Include the headers
- -k/–insecure – Allow insecure SSL connections
- -L/–location – Follow redirects
$ curl -ikL external.us-east-1.elb.amazonaws.com HTTP/1.1 301 Moved Permanently Server: awselb/2.0 Date: Thu, 11 Feb 2021 20:34:47 GMT Content-Type: text/html Content-Length: 134 Location: https://external.us-east-1.elb.amazonaws.com:443/ Proxy-Connection: Keep-Alive Connection: Keep-Alive Age: 0 HTTP/1.1 200 Connection established HTTP/2 302 date: Thu, 11 Feb 2021 20:34:48 GMT content-length: 0 location: http://demo.unavailable.com cache-control: no-cache HTTP/1.1 200 OK Content-Type: text/html Content-Length: 2071 Date: Thu, 11 Feb 2021 19:09:29 GMT Last-Modified: Tue, 18 Dec 2018 05:32:31 GMT Accept-Ranges: bytes Server: AmazonS3 X-Cache: Hit from cloudfront Via: 1.1 44914fa6421b789193cec8998428f8bd.cloudfront.net (CloudFront) Proxy-Connection: Keep-Alive Connection: Keep-Alive Age: 1071 <html
Using these commands was nothing new, however identifying this single line provided a way to isolate within the chain of redirects where to focus.
Ultimately the issue was not ELB related, but internal infrastructure behind this one ELB. When corrected the result was (trimmed for readability)
$ curl -ikL external.us-east-1.elb.amazonaws.com HTTP/1.1 301 Moved Permanently Server: awselb/2.0 Date: Thu, 11 Feb 2021 20:37:18 GMT Content-Type: text/html Content-Length: 134 Location: https://external.us-east-1.elb.amazonaws.com:443/ Proxy-Connection: Keep-Alive Connection: Keep-Alive Age: 0 HTTP/1.1 200 Connection established HTTP/2 302 date: Thu, 11 Feb 2021 20:37:18 GMT content-type: text/plain; charset=utf-8 content-length: 27 x-powered-by: location: /redirect vary: Accept HTTP/2 301 date: Thu, 11 Feb 2021 20:37:18 GMT content-type: text/html content-length: 162 location: /redirect/ HTTP/2 200 date: Thu, 11 Feb 2021 20:37:18 GMT content-type: text/html content-length: 2007 last-modified: Tue, 02 Feb 2021 03:27:13 GMT vary: Accept-Encoding <html> <head>
In summary, and a means to triage a future problem, or to monitor:
$ egrep -i "^HTTP|^Content-Length" HTTP/1.1 301 Moved Permanently Content-Length: 134 HTTP/1.1 200 Connection established HTTP/2 302 content-length: 0 HTTP/1.1 200 OK Content-Length: 2071
$ egrep -i "^HTTP|^Content-Length" HTTP/1.1 301 Moved Permanently Content-Length: 134 HTTP/1.1 200 Connection established HTTP/2 302 content-length: 27 HTTP/2 301 content-length: 162 HTTP/2 200 content-length: 2007
With the proliferation of GUI based monitoring products it is likely for many organizations that multiple different monitors are available, but are they triggered, and do they enable you to pinpoint the underlying issue? Long gone are the days of a Pingdom type ping of a URL from multiple locations every minute and a report of latency or errors then you start digging. This week I learned about DataDog Synthetic Monitoring. DataDog is a well established monitoring solution that I have only just started to understand, I wish I had a year to master to delving into.
In later review this monitoring showed an already configured browser test for this top level URL that was failing, it was simply not alerting correctly. The Synthetic monitoring is far more advanced, providing an ITTT workflow, and even provides physical images of the rendered pages.
This experience highlighted the need to have detailed and redundant monitoring but also the right process to triage and drill down.
I looked into trying to provide an example of this DataDog feature, however the free tier monitoring solution does not provide all the advanced features for the evaluation I’d like. You can look at some product examples.
Observability is a key tool in any operations management. It should be one of the pillars where a continued investment of time, resources and skills development can add significant value for business continuity.