Netflix, Google, Twitter Hacks

Debugging A/V pipelines at Netflix, Authentication outages at Google, and Twitter's July hack.

Jul 08, 2021

Hey there,

As a reminder, I share stories about how engineering teams respond when technology fails. This newsletter was almost called “all technology will fail.”

The case of the extra 40 ms

Link: https://netflixtechblog.com/life-of-a-netflix-partner-engineer-the-case-of-extra-40-ms-b4c2dd278513

Overview

This article was written by a member of the Partner Engineering team at Netflix. This isn't a postmortem, but I really enjoyed the author's approach to describing and debugging a large distributed embedded system.

Around the end of 2017, Netflix couldn't get their app to run smoothly on a new Android Lollipop set top device.

According to one Netflix vendor (the TV operator), playback on the app was stuttering: video would play for a very short time, then pause, then start again, then pause. Another vendor (the device integrator) could reliably reproduce the stuttering playback through a script. Yet another vendor (the chip manufacturer) determined that the stuttering was because Netflix's Android TV application was not delivering audio quickly enough. All signs pointed to Netflix (the final vendor) having a bug in the audio/video pipeline in the Netflix application for Android.

Still with me? This is where it gets interesting.

In order to play a 60 fps video, rendering software must render a new frame every 16.66 ms (1/60s). The Netflix Android app did this by calling an API that provides the next frame of audio or video data, passing it to the Android audio service, then telling the thread scheduler to wait 15 ms before repeating this process again -- this is what ultimately delivers your favorite TV shows. Checking for a new frame every 15 ms is fast enough to stay ahead of the video stream except, in this case, it wasn't. Eventually, the author narrowed it down to a recently fixed bug in the Android thread scheduler.

Let's talk about Android's thread scheduling policy.

In Android 5.0 (Lollipop), threads created in the background are assigned a 40 ms wait time. When threads are moved to the foreground, this wait time value was not being updated. So, when the audio handler thread was created in the background, it had to wait for 40 ms before it could load more data which is what caused the stuttering playback. However, if the audio handler thread was created in the foreground, it would have no trouble keeping up the pace.

Figure 1: Device Playback Pipeline (simplified). Source: original article

40ms might not seem like a lot, but for comparison, the Android team fixed this by assigning foreground threads a 50 microseconds (0.05 milliseconds) wait time.

Learnings

This was a fun read. My takeaways:

Reproduce the issue, preferably in a repeatable manner. One of the vendors did this with a script.
Having logs will always be beneficial, but it’s not the best way to build a mental model of what's going on. For this investigation, the author visualized the audio throughput and thread handling behavior and focused on a few numbers: the rate of data transfer, the time when the handler was invoked and the time when the handler passed control back to Android.

Figure 2: Visualizing Audio Throughput and Thread Handler Timing. Source: original article

Ask for help. The author found the Netflix engineer who gave him a guided tour of the wrote the audio and video pipeline code.
Sprinkle in some luck. The author discovered the issue had been fixed in the next version of Android (6.0).

This article skips over any details of the fix from Netflix, but my bet is that they made the playback pipeline thread start in the foreground in order to coerce Android Lollipop into skipping the extra 40ms of wait time.

Google Outage

Links: https://status.cloud.google.com/incident/zall/20013

Overview

On December 14, 2020, all customer-facing Google services that required authentication were unavailable. The authentication service became unavailable due to a misconfiguration of a new quota system. This meant that attempts to login were rejected and users were unable to log into various Google accounts.

Google's User ID Service maintains a unique identifier for every account and also handles authentication credentials and account metadata lookup APIs.

Prior to the outage, Google's User ID authentication service was part of an ongoing migration to a new quota system.

Think of a quota system as guard rails for running software. Data centers host and run software built by hundreds or thousands of customers. Quota systems are used to enforce limits on system resources. Limiting CPU, memory, storage, or network bandwidth usage is one way to ensure that one customer’s software doesn’t accidentally steal resources from neighboring customers.

The new quota system incorrectly reported the usage for the User ID Service as 0, as in, it’s unused. This instructed the new quota system to decrease the quota allowed for the User ID service. This was an unforeseen side effect: if no one is using the User ID service, why not free up resources for other software services.

Except the User ID service, the one that handles authentication for global customers was in use.

Learnings

Let's go straight to the source for learnings.

In addition to fixing the underlying cause, Google will be implementing changes to prevent, reduce the impact of, and better communicate about this type of failure in several ways:

Review the quota management automation to prevent fast implementation of global changes.
Improve monitoring and alerting to catch incorrect configurations sooner.
Improve reliability of tools and procedures for posting external communications during outages that affect internal tools.
Evaluate and implement improved write failure resilience into the User ID service database.
Improve resilience of GCP Services to more strictly limit the impact to the data plane during User ID Service failure.

Coverage of Twitter's July 2020 hack

This one is for Twitter users. Remember how every verified accounts was locked last July?

Here’s what went down.

Inside the Twitter Hack

Because it didn’t know where the attack was coming from, Twitter couldn’t predict what celebrity might fall next. Turning the service off altogether wasn’t practical; according to one former executive, it’s not even clear that Twitter could easily do that if it wanted to. But by 6:18 pm ET the team opted for the next-harshest thing: Block all verified accounts from tweeting. They placed further restrictions on any accounts that had changed their password in the previous weeks.

An update on our security incident

The social engineering that occurred on July 15, 2020, targeted a small number of employees through a phone spear phishing attack. A successful attack required the attackers to obtain access to both our internal network as well as specific employee credentials that granted them access to our internal support tools. Not all of the employees that were initially targeted had permissions to use account management tools, but the attackers used their credentials to access our internal systems and gain information about our processes. This knowledge then enabled them to target additional employees who did have access to our account support tools. Using the credentials of employees with access to these tools, the attackers targeted 130 Twitter accounts, ultimately Tweeting from 45, accessing the DM inbox of 36, and downloading the Twitter Data of 7"

New York State's Department of Financial Services

The Twitter Hack is a cautionary tale about the extraordinary damage that can be caused even by unsophisticated cybercriminals. The Hackers’ success was due in large part to weaknesses in Twitter’s internal cybersecurity protocols.
The problems started at the top: Twitter had not had a chief information security officer (“CISO”) since December 2019, seven months before the Twitter Hack. A lack of strong leadership and senior-level engagement is a common source of cybersecurity weaknesses. Strong leadership is especially needed in 2020, when the COVID-19 pandemic has created a host of new challenges for IT and cybersecurity. Like many organizations, in March Twitter transitioned to remote working due to the pandemic. This transition made Twitter more vulnerable to a cyberattack and compounded existing weaknesses.