AzureRecipes

Summary

This page contains an overview of the services used and usable for monitoring and analysis, their relationships, and best practices for their use in PaaS/Serverless architectures.

Overview

There are many services that collect and analyze runtime data from resources and allow you to gain insights. The documentation of these services is overwhelming and it is sometimes difficult to know what to use and configure in which situation. Many of these services date back to when IaaS was the primary cloud architecture and are still very focused on infrastructure aspects.

The above figure tries to take a fresh look at all these services and mainly shows their relationships. There are mainly the following distinctions:

A brief history of monitoring

Azure Resources automatically generate runtime information which are stored inside the resource itself and analysed by Monitor automatically. On almost all resources you find the section Monitoring and within that, the topic Metrics with the integration of the global Monitor service (and the most relevant metrics are typically made available on the Overview page). This has following aspects:

For more advanced usage you need to forward that using Diagnostic Settings to (multiple possible):

Routing data to Log Analytics Workspace brings following benefits:

Additionally, there are more advanced analysis possibilities by using Workbooks. These provide in-depth analysis of the data for a particular scope or topic. There are a bunch of standard Workbooks or you may create custom Workbooks by your own. Alternatively, you can install Solutions from the Azure Marketplace which mainly bring in additional Workbooks.

Azure Resources which include or execute conventional code use an alternative solution: Application Insights. This service runs on top of an Log Analytics Workspace and extends its storage and querying functionality with very use- and powerful features:

As the data store of an Application Insights always is a Log Analytics Workspace, the retention time has to be configured there. But for cost optimizations, there are some features such as sampling or caping (daily maximal volume) available.

Application Insights may be integrated by almost any Software running anywhere. Besides applications in Virtual Machines or Containers this also includes e.g. Single Page Applications running in Browsers. With such an approach it is possible to create a central place for any insights from an application.

Best Practices

Log Analytics Workspace

Application Insights

Activity Log

Alerts

Standardized Alerting Strategy

Depending on those who assume operational responsibility, a suitable alerting strategy should be defined. This should be consistent for all applications in the organization, and ensure that the correct parties are informed of important events or problem indications.

A complete template including deployment definitions for basic components like action groups can be found here: Guideline Alerting Strategy

Monitor the specified Service Level

Analyze the given Service Level Agreement or promote the definition of according goals. For each relevant technical aspects define:

Typical Monitoring Topics

Following list may help to identify critical aspects of an application for monitoring with Alert Rules.

Resource Aspect Purpose Examples / References
Resource Group User Activities Especially for productive environments it may be valuable to get notified of any manual changes (e.g. to make sure they are properly reflected in documentation or deployment scripts) Alert Rule (Bicep)
Function App Duration If running in consumption plan, the duration is limited to 5 minutes (default) and can be extended to maximally 10 minutes. An alert on durations of more than e.g. 80% can help to detect issues early and thus avoids unhandled timeout failures in production. Alert Rule (Bicep)
App Service Server Errors Unhandled exception leads to HTTP 500 results on HTTP-triggered Functions or Web Apps. This is okay on pre-production systems but should be analyzed for prevention of the failure or appropriate error handling Alert Rule (Bicep)
App Service Error-Rate This may indicate systematic problems (e.g. configuration failures) typically after a deployment Alert Rule (Bicep)
App Service CPU / Memory / Disk Usage For Functions on a dedicated App Service Plan or Web Apps without configured auto-scaling this should be monitor to prevent overload situations -
Application Insights Smart Detection The above described smart detection rules can now be migrated to regular alerts, which improves the capabilities for processing Alert Rule (Bicep)
Application Insights Requests Last execution > x time: For specific use cases this may provide a valid measure to detect failures KQL query to summarize a metric for the last workday
Application Insights Availability Tests As explained in the text above, this is a great feature to continuously observe endpoints Classic or Standard Availability Test with Alert Rule (Bicep)
Cognitive Search Index Size Depending on the used plan, the number of indexes and especially the available storage is very limited and may cause problems in production. Unfortunately, these metrics are not yet logged - creating a regular metric- or log-based Alert Rule is not yet possible -
Data Factory Pipeline Executions Inform about automatically triggered but failed executions (e.g. of integration or backup jobs) Alert Rule (Bicep)
Logic App Executions Inform about automatically triggered but failed executions (e.g. of integration or backup jobs) Alert Rule (Bicep)
Service Bus Dead Letter Queue Inform about final cancelled and sorted out messages Alert Rule (Bicep)
API Management Capacity API Management in non-Consumption plans need to be scaled manually (or with auto-scale rules). The Capacity metric is the appropriate information to evaluate scaling needs. Alert Rule (Bicep)
SQL Database DTU Model Databases in the cost-efficient DTU model need to be scaled manually, which can be determined using the DTU Percentage metric Alert Rule (Bicep)
Cosmos DB Manual/Provisioned Throughput Model For DB accounts not beeing in the serverless or autoscale model, the provisioned capacity should be observed and adjusted when needed. The Normalized RU Consumption metric is the appropriate information to evaluate scaling needs. Alert Rule (Bicep)

A full-fledged Bicep module to deploy e selection of standard alert rule can be found here: Standard Alert Rules