Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
Postmortem: Java App Engine outage, July 14, 2011
Wednesday, July 20, 2011
Summary
Last week, we posted about a limited outage on July 14, 2011. Now that our internal postmortem is complete, we thought you would also like to get more detail about what went wrong and what we are going to do to ensure this doesn't happen again.
Root Cause and Analysis
The main lesson learned is to improve our live traffic testing as a relatively minor bug triggered a corner case for some of our customers. The bug was in a new release of the infrastructure in the App Engine Java execution environment. During development, testing, and qualification, this bug was essentially hidden from view because it only manifested itself under specific load patterns. During the outage, requests to affected applications would fail with errors when traffic was routed to affected instances. Application logs would have shown affected instances experienced high latency, error rates, or were not reachable from the Internet. This could have been caught by letting the live traffic testing run longer.
In order for live traffic testing to work properly, we need to improve our monitoring as well. In this case, having more points from which to do black box monitoring would have helped immensely. We are currently working on much broader monitoring for App Engine and will be integrating more extensive black box testing in upcoming quarters.
Once again, we’d like to point out that we could have done a much better job of communicating issues to all of you. While we strive to strike a balance between letting you know about major issues and not bothering you about the day-to-day operations; we clearly should have communicated this incident to you sooner. Rest assured you’ll be better informed of issues in the future.
Timeline
July 14, 2011 - 11:30 AM US/Pacific - The new Java execution environment is released to production.
July 14, 2011 - 5:00-6:00 PM US/Pacific - The previously scheduled Master/Slave read-only maintenance period occurred.
July 14, 2011 - 8:00-9:30 PM US/Pacific - Monitoring shows error rates and latency for Java applications using the Master/Slave datastore are slowly increasing across the entire system. Investigation reveals that the new Java execution environment is malfunctioning.
July 14, 2011 - 9:30 PM US/Pacific - Rollback of the Java execution environment to the previous version begins. Latency and error rates begin to fall.
July 14, 2011 - 11:30 PM US/Pacific - Rollback of the Java execution environment to the previous version completes. Java Master/Slave applications are functioning normally.
Remediation
Faster notification on our status site and downtime-notify mailing list
More live traffic stress tests for new releases
Better black box monitoring to detect small impacts more quickly
[Edit] Clarification: no HR datastore apps were affected. Overall, the outage resulted in a 1.9% error rate, affecting approximately 0.005% of all App Engine traffic at peak.
Posted by Wesley Chun, Google App Engine Team
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Firebase Blog
Apigee Blog
Popular Posts
World's largest event dataset now publicly available in BigQuery
A look inside Google’s Data Center Networks
Enter the Andromeda zone - Google Cloud Platform’s latest networking stack
Using labels to organize Google Cloud Platform resources
New in Google Cloud Storage: auto-delete, regional buckets and faster uploads
Labels
Announcements
193
Big Data & Machine Learning
134
Compute
271
Containers & Kubernetes
92
CRE
27
Customers
107
Developer Tools & Insights
151
Events
38
Infrastructure
44
Management Tools
87
Networking
43
Open
1
Open Source
135
Partners
102
Pricing
28
Security & Identity
85
Solutions
24
Stackdriver
24
Storage & Databases
164
Weekly Roundups
20
Feed
Subscribe by email
Demonstrate your proficiency to design, build and manage solutions on Google Cloud Platform.
Learn More
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow