Amazon S3 suffered a significant outage on Wednesday in its US-East-1 region. This outage affected a number of companies in what seemed to be unpredictable ways. Yesterday a DNS outage at GoDaddy caused similar effects on availability of what otherwise seems like an unrelated set of Internet sites. We saw similar outages last year as a result of configuration problems at Level 3 and DDoS attacks from the Mirai botnet. All of these outages point to significant resilience issues incurred with cloud and managed hosting services. These resilience issues should be approached as part of risk management planning, but as our recent study in Ashburn VA highlighted, shared vocabulary for these types of informed risk decisions between customers and data center and network providers is often not adequate.
Cloud services offer convenience and accessibility abound. Cloud computing models such as Platform as a Service (PaaS), Software as a Service (SaaS), and Infrastructure as a Service (IaaS) provide businesses with convenient ways to obtain access to large amounts of data from any location. In theory, reliance on the cloud can help mitigate risk by decentralizing data in a specific location. However, this model of thinking can be somewhat risky, as confirmed in light of the Amazon Web Services (AWS) S3 outage that occurred on Wednesday.
Amazon’s Simple Storage Service (S3) web-based storage service at the Northern Virginia datacenter experienced widespread issues on Tuesday. Consequently, providers that rely on Amazon storage encountered either partial or full unavailability on websites, apps, and devices reliant upon that service. AWS S3 is responsible for hosting a variety of data, including but not limited to images, app backends, and entire websites. [i] While Amazon’s Northern Virginia datacenter was down, S3 did remain up in its 13 other regions. [ii] Unfortunately, Amazon’s services that monitor S3 also depended on the S3 service itself, therefore their monitoring infrastructure could not adequately report on its own status until parts of S3 were restored.
The consequence of this outage is significant as Amazon S3 is used by around 148,213 websites, and 121,761 unique domains. [iii] Amazon’s AWS service health dashboard relayed that the S3 outage was due to “high error rates with S3 in US-EAST-1.” [iv] Amazon has since released what caused the service disruption in the Northern Virginia (US-EAST-1) Region: [v]
The S3 team was debugging a billing issue that was causing their system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. One of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable. […] By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.
This event brings to light a critical issue when depending upon cloud infrastructure. Business may feel that they are mitigating their risk and increasing resilience by decentralizing their data in the cloud, but this may or may not be true, as even cloud servers are often still geographically collocated to a specific area. Further, those who utilize cloud-based storage may not be aware that their data is being held in a specific geographical location, and not employ offsite backups that are regularly updated, which can cause major issues when their cloud provider has availability issues. A way for businesses to increase resilience is to store data in multiple regions, or build a site that does not rely [solely] on S3 as its only storage method.
Cascading dependencies make an informed risk planning regime more difficult. If multiple cloud providers rely on the same dependencies that become unavailable, then multiple providers are also rendered unavailable. For more information on the sticky issues of cloud dependencies, see our interactive cloud infographic here.
Dependencies and interdependencies between data center and network providers are not well understood and this continues to compound the problems of building resilient infrastructure that depend on cloud services.
COAR is currently leading studies that will investigate these exact resilience issues. The Department of Homeland Security (DHS) Office of Infrastructure Protection (IP) addresses a range of hazards that could have significant consequence both regionally and nationally via the Regional Resiliency Assessment Program (RRAP). COAR is on FY3 of the Ashburn RRAP, and has just started kick-off meetings for the Denver RRAP. You can read more about the results of Ashburn RRAP here.
This post was written by: Mike Thompson
i. Darrell Etherington. Tech Crunch. Amazon AWS S3 outage is breaking things for a lot of websites and apps. Accessed March 2 2017. Available via: https://techcrunch.com/2017/02/28/amazon-aws-s3-outage-is-breaking-things-for-a-lot-of-websites-and-apps/
ii. Miller, Ron. Tech Crunch. The Day Amazon S3 Storage Stood Still. Accessed March 2 2017. Available via: https://techcrunch.com/2017/03/01/the-day-amazon-s3-storage-stood-still/
iii. Similar Tech. Amazon S3. Accessed March 2 2017. Available via: https://www.similartech.com/technologies/amazon-s3
v. Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region. Accessed March 2 2017. Available via: https://aws.amazon.com/message/41926/