Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
About today's App Engine outage
Friday, October 26, 2012
This morning we failed to live up to our promise, and Google App Engine applications experienced increased latencies and time-out errors.
We know you rely on App Engine to create applications that are easy to develop and manage without having to worry about downtime. App Engine is not supposed to go down, and our engineers work diligently to ensure that it doesn’t. However, from approximately 7:30 to 11:30 AM US/Pacific, about 50% of requests to App Engine applications failed.
Here’s what happened, from what we know today:
Summary
4:00 am - Load begins increasing on traffic routers in one of the App Engine datacenters.
6:10 am - The load on traffic routers in the affected datacenter passes our paging threshold.
6:30 am - We begin a global restart of the traffic routers to address the load in the affected datacenter.
7:30 am - The global restart plus additional load unexpectedly reduces the count of healthy traffic routers below the minimum required for reliable operation. This causes overload in the remaining traffic routers, spreading to all App Engine datacenters. Applications begin consistently experiencing elevated error rates and latencies.
8:28 am -
google-appengine-downtime-notify@googlegroups.com
is updated with notification that we are aware of the incident and working to repair it.
11:10 am - We determine that App Engine’s traffic routers are trapped in a cascading failure, and that we have no option other than to perform a full restart with gradual traffic ramp-up to return to service.
11:45 am - Traffic ramp-up completes, and App Engine returns to normal operation.
In response to this incident, we have increased our traffic routing capacity and adjusted our configuration to reduce the possibility of another cascading failure. Multiple projects have been in progress to allow us to further scale our traffic routers, reducing the likelihood of cascading failures in the future.
During this incident, no application data was lost and application behavior was restored without any manual intervention by developers. There is no need to make any code or configuration changes to your applications.
We will proactively issue credits to all paid applications for ten percent of their usage for the month of October to cover any SLA violations. This will appear on applications’ November bills. There is no need to take any action to receive this credit.
We apologize for this outage, and in particular for its duration and severity. Since launching the
High Replication Datastore
in January 2011, App Engine has not experienced a widespread system outage. We know that hundreds of thousands of developers rely on App Engine to provide a stable, scalable infrastructure for their applications, and we will continue to improve our systems and processes to live up to this expectation.
- Posted by Peter S. Magnusson, Engineering Director, Google App Engine
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Firebase Blog
Apigee Blog
Popular Posts
Understanding Cloud Pricing
World's largest event dataset now publicly available in BigQuery
A look inside Google’s Data Center Networks
Enter the Andromeda zone - Google Cloud Platform’s latest networking stack
Getting your data on, and off, of Google App Engine
Labels
Announcements
193
Big Data & Machine Learning
134
Compute
271
Containers & Kubernetes
92
CRE
27
Customers
107
Developer Tools & Insights
151
Events
38
Infrastructure
44
Management Tools
87
Networking
43
Open
1
Open Source
135
Partners
102
Pricing
28
Security & Identity
85
Solutions
24
Stackdriver
24
Storage & Databases
164
Weekly Roundups
20
Feed
Subscribe by email
Demonstrate your proficiency to design, build and manage solutions on Google Cloud Platform.
Learn More
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow