Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
Alerting on a metric’s rate of change
Wednesday, October 14, 2015
Part of your daily routine managing your cloud services is to make sure bad things don’t happen. And when they do
–
you want to be notified as soon as possible. For that, Google Cloud Monitoring provides you with powerful alerting capabilities that notify you whenever conditions you’ve defined are triggered. For example, you can configure Cloud Monitoring to notify you when CPU utilization crosses a certain threshold, or your application serving latency is too high.
But what if you don’t know exactly which thresholds to use, or what “too high” really means? What if instead, you just want to notify when things change from their norm?
For that, we’re introducing the ability to
alert on a metric’s rate of change
. Instead of choosing a threshold for a certain metric (e.g. “notify me when it crosses 20”), you can now specify a percentage change instead (e.g. “notify me when it changes by more than 30% compared to it’s previous values”). This new alert condition type is especially useful for “stable” metrics that are expected to remain within a certain range and not change too abruptly for extended periods of time, such as latency, QPS and error ratios.
Understanding how metric rate of change conditions are evaluated
Suppose you want to be alerted when disk usage increases by over 30% in a time period of 15 minutes or less (details of how to configure this in the next section). Here’s how this condition is evaluated:
Every minute at time T, we compute a simple 10 minute moving average for the metric in question (in this case, disk usage), at times T and T minus 15 minutes.
We then check the difference between the two, and if it exceeds the percentage specified, the alert condition is triggered.
Note that the 15 minute period is used as an example and is configurable. However, the 10 minute moving average is fixed, and is used to smooth the metric and prevent false positives. Without this smoothing function, a quick spike in the metric would trigger the alert condition even if it’s very short lived and resolves itself instantly.
You can configure the increase and decrease of a metric separately, so in the above example a drop of 30% or more would NOT trigger this alert condition.
Lets look at the visual example above, using the same condition previously described. As you can see, disk usage starts rising a little after 9:30 (where the gray vertical line is), but the alert condition isn’t triggered yet since it only increases by 20%. Around 9:45 it increases some more and triggers the condition (the red horizontal line at the top of the graph).
The condition is auto resolved at around 10:10, since looking back 15 minutes doesn’t yield any increase of 30% or greater. In other words, the metric has stabilized for 15 minutes. The condition is again triggered a little after 10:30 when disk usage increases from about 40% to over 80%.
Configuring metric rate of change alerting conditions
To create a rate of change alerting condition, simply choose “Metric Rate of Change” from the “Add Condition” menu when creating or editing an alerting policy. Like other condition types, Metric Rate of Change can be combined with other conditions to create a multi-condition alerting policy.
When configuring this condition, you can choose the metric, whether to trigger on an increase or decrease, the percent of increase or decrease that would trigger the condition, and the time window to examine for change. The screenshot below shows the condition we examined in the previous example.
Note: you should take into account metric changes related to time of day, to avoid false notifications. So for example, if average QPS increases tenfold in the morning, this might unnecessarily trigger alert conditions that check for increased QPS activity.
Conclusion
Alerting on a metric’s rate of change gives you more flexibility when monitoring dynamic systems without the need to explicitly define
–
and maintain
–
static thresholds. These conditions adapt to the way your system behaves, part of our ongoing effort to provide you with a monitoring system that requires less maintenance, so you have more time to innovate. Give it a try
–
look for a stable metric that should remain within a given range, and configure a percentage increase or decrease that would signal a change that’s worth investigating. You can read more about setting up alerts
here
.
Please send us an
email
and let us know what you think. Your feedback helps us improve.
-
Posted by Amir Hermelin, Product Manager
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Firebase Blog
Apigee Blog
Popular Posts
Understanding Cloud Pricing
World's largest event dataset now publicly available in BigQuery
A look inside Google’s Data Center Networks
New in Google Cloud Storage: auto-delete, regional buckets and faster uploads
Enter the Andromeda zone - Google Cloud Platform’s latest networking stack
Labels
Announcements
193
Big Data & Machine Learning
134
Compute
271
Containers & Kubernetes
92
CRE
27
Customers
107
Developer Tools & Insights
151
Events
38
Infrastructure
44
Management Tools
87
Networking
43
Open
1
Open Source
135
Partners
102
Pricing
28
Security & Identity
85
Solutions
24
Stackdriver
24
Storage & Databases
164
Weekly Roundups
20
Feed
Subscribe by email
Demonstrate your proficiency to design, build and manage solutions on Google Cloud Platform.
Learn More
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow