Roombas, Coinbase, Come Work With Me

February's edition of Blameless Retros covers outages from Amazon and Coinbase

Feb 04, 2021

Welcome back

It's been a minute since I've published an edition of Blameless Retros.

A reminder about who I am and what this is all about.

You’re reading Blameless Retros, a newsletter about how engineering teams learn and respond to failures of technology. I’m Marc Chung and I started this newsletter in 2019. This newsletter is a way to share what I’ve learned from reading incident reports. Weird hobby, I get it.

Though 2020 made it difficult to write (amongst other things), I’m still collecting incident reports—postmortems, interesting Twitter threads—and working through them.

If you’re still interested in reading along, welcome back. If you are no longer interested, please scroll down and unsubscribe.

Our first edition in 2021 covers incidents from Amazon and Coinbase.

When the Roombas stopped working

Link: https://aws.amazon.com/message/11201/

In the days leading up to Thanksgiving last year, you might remember having trouble logging into your favorite websites. There's a high probability this was due to an outage at AWS. This impacted several AWS customers including Roku, AutoCAD, the NYC Subway, and iRobot, the makers of roomba—the robot vacuum cleaners.

Summary

On 11/25/2020, two days before Thanksgiving in the U.S., AWS engineers added additional capacity to Kinesis's front-end fleet of servers. This capacity increase caused all of the servers in the fleet to exceed the maximum number of threads allowed by the operating system which in turn disrupted the front-end servers ability to route requests to the back-end.

The team rolled back the changes that triggered the event and proceeded to restart Kinesis. This outage had a wide blast radius due to a couple of factors. It occurred at the US-East-1 data center in Virginia, the first and oldest AWS data center. The Kinesis outage also made Amazon Incognito partially unavailable preventing end-users from getting things done.

Why did it take so long for Kinesis to fully recover?

The postmortem shares: "The front-end fleet consists of many thousands of servers and since they could only add servers at the rate of a few hundred per hour, the restart took many hours to complete."

As for the total outage, here’s the timeline I pieced together:

5:15 AM PST - Team receives the first alerts
9:39 AM PST - Team confirms the root cause and shortly after restarts the front-end fleet.
Around noon PST - Error rates start dropping from noon onwards
10:23 PM PST - Team confirms that Kinesis was fully returned to normal

Engineers found the root cause about 4 hours after the first alert, error rates began dropping 3 hours later, and it was another 10 hours until Kinesis was fully operational again.

What exactly do Kinesis' front-end servers do?

Again, the postmortem shares: "The front-end's job is to handle authentication, throttling, and routing to the correct back-end clusters." Front-end servers act like a proxy or a control plane routing requests to the back-end clusters which provide scalability around stream processing.

What else was impacted?

Amazon Cognito, CloudWatch, EventBridge, and several AWS customers with user-facing products.

Amazon Cognito is an AWS product that offers account management as an API. Due to the outage’s length, Cognito webservers began to block on Kinesis. Recall that requests to Kinesis were stalling because the front-end fleet was unable to route requests to the back-end. As Cognito became unavailable, businesses that relied on Cognito had customers who experienced slow or completely unusable software.

Wanna understand what that experience felt like as an end-user? Let's turn to Twitter for that perspective:

Adobe Spark users were unable to access or edit their projects.

Adobe Spark @AdobeSpark

An Amazon AWS outage is currently impacting Adobe Spark so you may be having issues accessing/editing your projects. We are actively working with AWS and will report when the issue has subsided. status.adobe.com for current Spark status. We apologize for any inconvenience!

status.adobe.comAdobe Status

New Roku users (TVs, set top boxes) were unable to activate their devices or browser their channels.

Roku Support @RokuSupport

🎉We believe Roku services impacted by the AWS outage are coming back online, including activations. Thanks for hanging in there! You still may still see hiccups as services resume. ➡️ support.roku.com/article/360059…

support.roku.comRokuRoku provides the simplest way to stream entertainment to your TV. On your terms. With thousands of available channels to choose from.

Flickr users were unable to login or create new accounts.

Flickr Help @FlickrHelp

[status] Identified: Members are receiving an "Oops, something went wrong" error when trying to log in or create an account due to a current AWS outage. We are monitoring the situation and will continue to share information here as it…

stspg.ioLogin & account creation issuesMembers are receiving an “Oops, something went wrong” error when trying to log in or create an account due to a current AWS outage. We are monitoring the situation and will continue to share information here as it becomes available. You can also keep up to date on AWS outages at https://status.aws.amazon.com/.

Autodesk users couldn't verify license verifications and subsequently were not able to use AutoCAD

Autodesk @autodesk

Multiple Autodesk services and applications are experiencing a disruption due to an AWS data center outage. AWS is aware and working to resolve the issue. We apologize for the inconvenience.

Chime users experienced intermittent issues

Chime @Chime

We are currently experiencing intermittent issues with our mobile apps and website. Card transactions and ATM withdrawals have not been affected and are working normally with the exception of SpotMe purchases. For status updates, visit chime.com/status.

chime.comChime Status - Service AvailabiltyIs Chime down? Get the most recent and accurate updates on Chime’s services availability.

Ring users couldn't login to their app and received errors. One Ring users claimed all 70 motion alerts were delivered at once but without videos.

Ring @ring

We are aware of a service interruption impacting Ring. We apologize for the inconvenience and appreciate your patience and understanding. Please check for updates here: status.ring.com.

status.ring.comRing.com StatusWelcome to Ring.com’s home for real-time and historical data on system performance.

Shipt customers couldn't place orders from their app.

Shipt @Shipt

Due to a third-party outage, we are currently experiencing disrupted service. We apologize for the inconvenience and appreciate your patience and understanding.

The NYC Subway was unable to remove an alert - presumably because they couldn’t log in to their admin portal.

NYCT Subway. Wear a Mask. @NYCTSubway

Hello. We are currently unable to remove the A line service alert from our website and app because of the widespread Amazon AWS outage. A trains are no longer running local in Brooklyn. We will continue to post updates here as we have them.

Finally, the outage also impacted roombas.

iRobot @iRobot

An Amazon AWS outage is currently impacting our iRobot Home App. Please know that our team is aware and monitoring the situation and hope to get the App back online soon. Thank you for your understanding and patience.

The BBC confirms though roombas were disabled, they could still be used without an internet connection — if you didn't need to schedule cleanings, activate it remotely, or keep the robot in a specific room.

Back to AWS. CloudWatch Events and EventBridge also experienced delays due to this outage.

From the postmortem, "CloudWatch Events and EventBridge experienced increased API errors and delays in event processing starting at 5:15 AM PST. As Kinesis availability improved, EventBridge began to deliver new events and slowly process the backlog of older events."

According to this AWS user, "[the] CloudWatch outage meant we couldn’t see what was and wasn’t working cloud-side."

Ben Kehoe @ben11kehoe

I want to talk a bit about what this was like. tl;dr: it was *long* and inconvenient timing but, as an operations team, not particularly stressful. Questions of “when”, not how or if systems would come back. A lot of waiting and watching—and that’s desirable.

Ben Kehoe @ben11kehoe

I'm still working with my team to mitigate the after effects of the day-long AWS outage yesterday, including dealing with follow-on AWS issues. I've gotten three hours of sleep and it's ruining my Thanksgiving day. Hot take: I am thankful we have built serverless on AWS.

EventBridge is also used by Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS). The outage prevented customers from provisioning new clusters or scaling existing clusters for almost 12 hours.

What can we learn from all this?

The key ideas shared are: limiting the blast radius through a concept they call cellularization, adding observability to thread counts, and using instance types with bigger CPUs.

Limiting the blast radius

This is a new name (to me) for a familiar problem. Cellularization is described as "an approach to isolating the effects of failure within a service, and to keep the components of the service operating within a previously tested and operated range."

Limiting the effects of an outage is the idea that when services fail (and, they will), that the unavailable service does not cause further widespread outages to dependent services. These cascading effects are hard to reason through. When it happens, it looks a bit like falling dominos, but without the hindsight of having the pieces set up ahead of time. Services that are built to tolerate and embrace failures are key to limiting the blast radius and preventing dependencies from falling. To bring this to life, AWS engineers will be moving parts of their services to dedicated fleets.

I don’t think this is a one-size fits all approach, so think of cellularization here as a guiding strategy.

Monitoring for thread saturation

AWS engineers are also planning to add monitoring around thread consumption. Raise your hands if your monitoring strategy includes thread count headroom? Me neither. I view this as example of monitoring for saturation, the 4th Golden Signal described in Site Reliability Engineering. Monitoring thread count headroom will give the team confidence that additional capacity changes won't exceed the thread count thresholds set by the operating system.

Using instances with larger CPUs

As a high thread count was a contributing factor to the outage, AWS engineers will be exploring different server instance types to increase thread count headroom and creating some space for growth. It’s generally a common approach when systems have a fixed amount of resources.

By using instance types with higher CPU and more memory, they’re expecting to reduce the total number of servers, which means fewer threads, which means increased headroom for threads.

Bonus: Making time to test incident response plans

Buried in the learnings was this statement: "We didn’t want to increase the operating system [thread] limit without further testing."

It’s a good mindset to have. While identifying candidate solutions, they dismissed making any changes to the current operating system configuration as it wasn't an area they had previously tested. Increasing the thread limits may have been operationally straightforward, but it comes at the cost of risk. As changing those limits was an untested strategy, doing it live could have further exacerbated the situation.

Once the outage was wrapped up, they immediately tested the increase in thread count limits. Now the team has another technique for responding to similar incidents. Smart move.

In summary: what started as a routine operation to increase server capacity led to a long outage, impacting several other AWS services and many AWS customers, including roombas. Kinesis was partially or fully unavailable for 17 hours.

Coinbase's postmortem in April and May 2020

Link: https://blog.coinbase.com/incident-post-mortem-april-29-and-may-9-2020-56494decab9f

Coinbase published a postmortem covering two incidents in mid-2020. Yes, that's right, April and May (it has been awhile).

Summary of April’s incident

An outage was caused when too many connections were made to one of their primary databases. How did this happen? Computers, of course.

The spike in connections was due to two events occurring at about the same time: a deployment event and a scaling event that was automatically triggered due to elevated traffic loads.

These events individually would not have been problematic, but the simultaneous combination led to connection depletion.

A second outage occurred because client (web & mobile app users) were automatically retrying requests and led to a thundering herd outage, where a bunch of clients were attempting to reconnect at the same time.

Coinbase and the Coinbase Pro APIs were unavailable for a total of 72 minutes.

Summary of May’s incident

The postmortem was a bit complex to follow along.

The second incident involved high latency for all outgoing HTTP requests. This was made worst because their load balancer kept killing healthy application instances that were taking too long to respond.

What caused the high latency?

The high latency was caused by DNS timeouts for outgoing HTTP requests. Those DNS requests were being rate limited.

What caused the rate limiting?

From the postmortem “the increase in latency was due to instance-level rate limiting of the DNS queries.” The postmortem reads like they were rolling out a new rate limiting strategy and encountered some challenges with the rollout itself.

Their load balancer was also killing instances that were failing the health check because an HTTP request pool was being saturated (too many requests taking too long to respond eventually meant no more available resources to make HTTP requests).

Coinbase and Coinbase Pro APIs were partially unavailable for 43 minutes.

What can we learn from all this?

For the first incident, it's clear there’s value in defining a clear backoff strategy when working with a large fleet of devices or clients.

The most common practice to mitigating the effects of the thundering herd is to add a backoff mechanism that adds jitter to wait interval rates. Connection pools that warm up and cache connections can also be helpful.

In the second incident, my view is that the outage was due to a bad rollout of a new rate limiting strategy. That rollout caused requests to DNS to backup and eventually consuming a ton of system resources. Their load balancer’s decision to kill instances made things more complex to root cause.

There’s a lot going to consider, but one takeaway: monitor saturation rates in connection pools that guard access to fixed resources, like DNS.

Jobs

Well, if you made it this far check out some jobs below.

Handshake is hiring again.

Join the platform team. You'll be helping us roll out our new architectural vision by extracting shared functionality out of product teams. If you have GCP experience, that's a plus.

Senior Software Engineer on the Platform team

Work with me on the student team. We have ambitious goals this year. The Handshake network includes 14 million students and young alumni at more than 1,100 colleges and universities. Our team helps students engage with over 500,000 employers and discover great jobs.

Senior Software Engineer on the Student Team

And finally, a fun tweet.

What's the most memorable outage you've ever been a part of - Cindy Sridharan.

If you’ve enjoyed today’s newsletter, please tell a friend.

Thank you ❤️
— Marc Chung