Performance optimization starts with identifying key metrics, usually related to latency and throughput. The addition of monitoring to capture and track these metrics exposes weak points in the application. With metrics, optimization can be undertaken to improve performance metrics.
Additionally, many monitoring tools allow you to set up alerts on your metrics, so that you are notified when a certain threshold is met. For example, you might set up an alert to notify you when the percentage of failed requests is more than x% above normal levels. Monitoring tools will help you to identify what normal performance looks like and identify abnormal spikes in latency, error quantities, and other key metrics. The ability to monitor these metrics is especially important during business critical timeframes, or after new code has been pushed to production.
Identify latency metrics
Ensure that you keep your UI as responsive as you can, noting that users expect even higher standards from mobile applications. Latency should also be measured and tracked for backend services, particularly since it can lead to throughput issues if left unchecked.
Some suggested metrics to track include:
- Request duration
- Request duration at subsystem granularity (such as API calls)
- Job duration
Identify throughput metrics
Throughput is a measure of the total number of requests served over a given period of time. Throughput may be affected by latency of subsystems, so you might need to optimize for latency to improve throughput.
Some suggested metrics to track include:
- Queries per second
- Size of data transferred per second
- Number of I/O operations per second
- Resource utilization (CPU/memory usage, etc.)
- Size of processing backlog (pub/sub, number of threads, etc.)
Don't be mean
A common mistake in measuring performance is only looking at the mean (average) case. While this is useful, it doesn't provide insight into the distribution of latency. A better metric to track is the performance percentiles, for example select 50th/75th/90th/99th percentile for a metric.
Generally, optimizing can be done in two steps. First, optimize for 90th percentile latency. Then, consider the 99th percentile (also known as tail latency): the small portion of requests which take much longer to complete.
Server-side monitoring for detailed results
Server-side profiling is generally preferred for tracking metrics. The server side is usually much easier to instrument, allows access to more granular data, and is less subject to perturbation from connectivity issues.
Browser monitoring for end-to-end visibility
Browser profiling can provide additional insights into the end user experience. It can show which pages have slow requests, and you can then map these onto server-side monitoring to further investigate.
Google Analytics provides out-of-the-box monitoring for page load times in the page timings report. This provides several useful views for understanding the user experience on your site, in particular:
- Page load times
- Redirect load times
- Server response times
Monitoring in the cloud
There are many tools you can use to capture and monitor performance metrics for your application. For example, you can use Google Cloud Logging to log performance metrics to your Google Cloud Project, then set up dashboards in Google Cloud Monitoring to monitor and segment the logged metrics.
In the Logging guide, we share an example of logging to Google Cloud Logging from a custom interceptor in the Python client library. With that data available in Google Cloud, we can build metrics on top of the logged data to gain visibility into our application through Google Cloud Monitoring. Follow the guide for user-defined log-based metrics to build metrics using the logs sent to Google Cloud Logging.
Alternatively, you could use the Monitoring client libraries to define metrics in your code and send them directly to Monitoring, separate from the logs.
Log-based metrics example
As an example of log-based metrics, we can set up a custom metric in Google
Cloud Logging based on the logs sent in our Python
example.
Suppose we want to monitor the is_fault
value to better understand error rates
in our application. We can extract the is_fault
value from our logged data
into a new counter metric,
ErrorCount
.
In Cloud Logging, labels let you group your metrics into categories based on other data in the logs.
In this example, we configure a label for the method
field which is also sent
to Cloud Logging in our Python example. This allows us to look at how error
count breaks down by the Google Ads API method.
With the ErrorCount
metric and the Method
label configured, we can create a
new chart
in a Monitoring dashboard to monitor ErrorCount
, grouped by Method
.
Alerting
It is also possible in Cloud Monitoring and in other tools to configure alerting policies that specify when and how alerts should be triggered by your metrics. For instructions on setting up Cloud Monitoring alerts, follow the alerting guide.