Roombas, Coinbase, Come Work With Me
February's edition of Blameless Retros covers outages from Amazon and Coinbase
Welcome back
It's been a minute since I've published an edition of Blameless Retros.
A reminder about who I am and what this is all about.
You’re reading Blameless Retros, a newsletter about how engineering teams learn and respond to failures of technology. I’m Marc Chung and I started this newsletter in 2019. This newsletter is a way to share what I’ve learned from reading incident reports. Weird hobby, I get it.
Though 2020 made it difficult to write (amongst other things), I’m still collecting incident reports—postmortems, interesting Twitter threads—and working through them.
If you’re still interested in reading along, welcome back. If you are no longer interested, please scroll down and unsubscribe.
Our first edition in 2021 covers incidents from Amazon and Coinbase.
When the Roombas stopped working
Link: https://aws.amazon.com/message/11201/
In the days leading up to Thanksgiving last year, you might remember having trouble logging into your favorite websites. There's a high probability this was due to an outage at AWS. This impacted several AWS customers including Roku, AutoCAD, the NYC Subway, and iRobot, the makers of roomba—the robot vacuum cleaners.
Summary
On 11/25/2020, two days before Thanksgiving in the U.S., AWS engineers added additional capacity to Kinesis's front-end fleet of servers. This capacity increase caused all of the servers in the fleet to exceed the maximum number of threads allowed by the operating system which in turn disrupted the front-end servers ability to route requests to the back-end.
The team rolled back the changes that triggered the event and proceeded to restart Kinesis. This outage had a wide blast radius due to a couple of factors. It occurred at the US-East-1 data center in Virginia, the first and oldest AWS data center. The Kinesis outage also made Amazon Incognito partially unavailable preventing end-users from getting things done.
Why did it take so long for Kinesis to fully recover?
The postmortem shares: "The front-end fleet consists of many thousands of servers and since they could only add servers at the rate of a few hundred per hour, the restart took many hours to complete."
As for the total outage, here’s the timeline I pieced together:
5:15 AM PST - Team receives the first alerts
9:39 AM PST - Team confirms the root cause and shortly after restarts the front-end fleet.
Around noon PST - Error rates start dropping from noon onwards
10:23 PM PST - Team confirms that Kinesis was fully returned to normal
Engineers found the root cause about 4 hours after the first alert, error rates began dropping 3 hours later, and it was another 10 hours until Kinesis was fully operational again.
What exactly do Kinesis' front-end servers do?
Again, the postmortem shares: "The front-end's job is to handle authentication, throttling, and routing to the correct back-end clusters." Front-end servers act like a proxy or a control plane routing requests to the back-end clusters which provide scalability around stream processing.
What else was impacted?
Amazon Cognito, CloudWatch, EventBridge, and several AWS customers with user-facing products.
Amazon Cognito is an AWS product that offers account management as an API. Due to the outage’s length, Cognito webservers began to block on Kinesis. Recall that requests to Kinesis were stalling because the front-end fleet was unable to route requests to the back-end. As Cognito became unavailable, businesses that relied on Cognito had customers who experienced slow or completely unusable software.
Wanna understand what that experience felt like as an end-user? Let's turn to Twitter for that perspective:
Adobe Spark users were unable to access or edit their projects.

New Roku users (TVs, set top boxes) were unable to activate their devices or browser their channels.


Flickr users were unable to login or create new accounts.


Autodesk users couldn't verify license verifications and subsequently were not able to use AutoCAD

Chime users experienced intermittent issues


Ring users couldn't login to their app and received errors. One Ring users claimed all 70 motion alerts were delivered at once but without videos.

Shipt customers couldn't place orders from their app.

The NYC Subway was unable to remove an alert - presumably because they couldn’t log in to their admin portal.

Finally, the outage also impacted roombas.

The BBC confirms though roombas were disabled, they could still be used without an internet connection — if you didn't need to schedule cleanings, activate it remotely, or keep the robot in a specific room.
Back to AWS. CloudWatch Events and EventBridge also experienced delays due to this outage.
From the postmortem, "CloudWatch Events and EventBridge experienced increased API errors and delays in event processing starting at 5:15 AM PST. As Kinesis availability improved, EventBridge began to deliver new events and slowly process the backlog of older events."
According to this AWS user, "[the] CloudWatch outage meant we couldn’t see what was and wasn’t working cloud-side."


EventBridge is also used by Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS). The outage prevented customers from provisioning new clusters or scaling existing clusters for almost 12 hours.
What can we learn from all this?
The key ideas shared are: limiting the blast radius through a concept they call cellularization, adding observability to thread counts, and using instance types with bigger CPUs.
Limiting the blast radius
This is a new name (to me) for a familiar problem. Cellularization is described as "an approach to isolating the effects of failure within a service, and to keep the components of the service operating within a previously tested and operated range."
Limiting the effects of an outage is the idea that when services fail (and, they will), that the unavailable service does not cause further widespread outages to dependent services. These cascading effects are hard to reason through. When it happens, it looks a bit like falling dominos, but without the hindsight of having the pieces set up ahead of time. Services that are built to tolerate and embrace failures are key to limiting the blast radius and preventing dependencies from falling. To bring this to life, AWS engineers will be moving parts of their services to dedicated fleets.
I don’t think this is a one-size fits all approach, so think of cellularization here as a guiding strategy.
Monitoring for thread saturation
AWS engineers are also planning to add monitoring around thread consumption. Raise your hands if your monitoring strategy includes thread count headroom? Me neither. I view this as example of monitoring for saturation, the 4th Golden Signal described in Site Reliability Engineering. Monitoring thread count headroom will give the team confidence that additional capacity changes won't exceed the thread count thresholds set by the operating system.
Using instances with larger CPUs
As a high thread count was a contributing factor to the outage, AWS engineers will be exploring different server instance types to increase thread count headroom and creating some space for growth. It’s generally a common approach when systems have a fixed amount of resources.
By using instance types with higher CPU and more memory, they’re expecting to reduce the total number of servers, which means fewer threads, which means increased headroom for threads.
Bonus: Making time to test incident response plans
Buried in the learnings was this statement: "We didn’t want to increase the operating system [thread] limit without further testing."
It’s a good mindset to have. While identifying candidate solutions, they dismissed making any changes to the current operating system configuration as it wasn't an area they had previously tested. Increasing the thread limits may have been operationally straightforward, but it comes at the cost of risk. As changing those limits was an untested strategy, doing it live could have further exacerbated the situation.
Once the outage was wrapped up, they immediately tested the increase in thread count limits. Now the team has another technique for responding to similar incidents. Smart move.
In summary: what started as a routine operation to increase server capacity led to a long outage, impacting several other AWS services and many AWS customers, including roombas. Kinesis was partially or fully unavailable for 17 hours.
Coinbase's postmortem in April and May 2020
Link: https://blog.coinbase.com/incident-post-mortem-april-29-and-may-9-2020-56494decab9f
Coinbase published a postmortem covering two incidents in mid-2020. Yes, that's right, April and May (it has been awhile).
Summary of April’s incident
An outage was caused when too many connections were made to one of their primary databases. How did this happen? Computers, of course.
The spike in connections was due to two events occurring at about the same time: a deployment event and a scaling event that was automatically triggered due to elevated traffic loads.
These events individually would not have been problematic, but the simultaneous combination led to connection depletion.
A second outage occurred because client (web & mobile app users) were automatically retrying requests and led to a thundering herd outage, where a bunch of clients were attempting to reconnect at the same time.
Coinbase and the Coinbase Pro APIs were unavailable for a total of 72 minutes.
Summary of May’s incident
The postmortem was a bit complex to follow along.
The second incident involved high latency for all outgoing HTTP requests. This was made worst because their load balancer kept killing healthy application instances that were taking too long to respond.
What caused the high latency?
The high latency was caused by DNS timeouts for outgoing HTTP requests. Those DNS requests were being rate limited.
What caused the rate limiting?
From the postmortem “the increase in latency was due to instance-level rate limiting of the DNS queries.” The postmortem reads like they were rolling out a new rate limiting strategy and encountered some challenges with the rollout itself.
Their load balancer was also killing instances that were failing the health check because an HTTP request pool was being saturated (too many requests taking too long to respond eventually meant no more available resources to make HTTP requests).
Coinbase and Coinbase Pro APIs were partially unavailable for 43 minutes.
What can we learn from all this?
For the first incident, it's clear there’s value in defining a clear backoff strategy when working with a large fleet of devices or clients.
The most common practice to mitigating the effects of the thundering herd is to add a backoff mechanism that adds jitter to wait interval rates. Connection pools that warm up and cache connections can also be helpful.
In the second incident, my view is that the outage was due to a bad rollout of a new rate limiting strategy. That rollout caused requests to DNS to backup and eventually consuming a ton of system resources. Their load balancer’s decision to kill instances made things more complex to root cause.
There’s a lot going to consider, but one takeaway: monitor saturation rates in connection pools that guard access to fixed resources, like DNS.
Jobs
Well, if you made it this far check out some jobs below.
Join the platform team. You'll be helping us roll out our new architectural vision by extracting shared functionality out of product teams. If you have GCP experience, that's a plus.
Senior Software Engineer on the Platform team
Work with me on the student team. We have ambitious goals this year. The Handshake network includes 14 million students and young alumni at more than 1,100 colleges and universities. Our team helps students engage with over 500,000 employers and discover great jobs.
Senior Software Engineer on the Student Team
And finally, a fun tweet.
What's the most memorable outage you've ever been a part of - Cindy Sridharan.
If you’ve enjoyed today’s newsletter, please tell a friend.
Thank you ❤️
— Marc Chung