This page contains an overview of the services used and usable for monitoring and analysis, their relationships, and best practices for their use in PaaS/Serverless architectures.
There are many services that collect and analyze runtime data from resources and allow you to gain insights. The documentation of these services is overwhelming and it is sometimes difficult to know what to use and configure in which situation. Many of these services date back to when IaaS was the primary cloud architecture and are still very focused on infrastructure aspects.
The above figure tries to take a fresh look at all these services and mainly shows their relationships. There are mainly the following distinctions:
Azure Resources automatically generate runtime information which are stored inside the resource itself and analysed by Monitor automatically. On almost all resources you find the section Monitoring
and within that, the topic Metrics
with the integration of the global Monitor service (and the most relevant metrics are typically made available on the Overview
page). This has following aspects:
Dashboards
which therewith provide an almost real-time overview of one or multiple ResourcesFor more advanced usage you need to forward that using Diagnostic Settings
to (multiple possible):
Routing data to Log Analytics Workspace
brings following benefits:
Additionally, there are more advanced analysis possibilities by using Workbooks
. These provide in-depth analysis of the data for a particular scope or topic. There are a bunch of standard Workbooks
or you may create custom Workbooks
by your own. Alternatively, you can install Solutions
from the Azure Marketplace which mainly bring in additional Workbooks
.
Azure Resources which include or execute conventional code use an alternative solution: Application Insights
. This service runs on top of an Log Analytics Workspace
and extends its storage and querying functionality with very use- and powerful features:
Application Insights
uses context identifications such as user or device and builds powerful views comparable to what e.g. Google Analytics offers.Application Insights
has an included, minimal 90 days of data retention without additional costs (the regular retention period of Log Analytics Workspace is just 30 days).As the data store of an Application Insights
always is a Log Analytics Workspace
, the retention time has to be configured there. But for cost optimizations, there are some features such as sampling or caping (daily maximal volume) available.
Application Insights
may be integrated by almost any Software running anywhere. Besides applications in Virtual Machines or Containers this also includes e.g. Single Page Applications running in Browsers. With such an approach it is possible to create a central place for any insights from an application.
Application Insights
wherever possible (mainly App Service
based resources and API Management
instances). Share instance of Application Insights
for components of same application or - for bigger architectures - isolated part of the application./api/status
) with an implementation of a connectivity check to all dependencies (e.g. database, bus, storage, …) there. Include only dependencies which are included in you SLA (external systems may be not)TelemetryClient
and track events with context information. Documentation for Azure Functions: https://docs.microsoft.com/en-us/azure/azure-functions/functions-monitoring?tabs=cmd#log-custom-telemetry-in-c-functions.Application Insights
instance as the backend services use. If application has a user-login, also attach a user identifier to the setup, to improve accuracy of session correlations. Documentation: https://learn.microsoft.com/en-us/azure/azure-monitor/app/javascript.Log Analytics Workspace
: Snippet for ARM-based deployment - this enables an easy increase of the data retention.Depending on those who assume operational responsibility, a suitable alerting strategy should be defined. This should be consistent for all applications in the organization, and ensure that the correct parties are informed of important events or problem indications.
A complete template including deployment definitions for basic components like action groups can be found here: Guideline Alerting Strategy
Analyze the given Service Level Agreement or promote the definition of according goals. For each relevant technical aspects define:
Following list may help to identify critical aspects of an application for monitoring with Alert Rules
.
Resource | Aspect | Purpose | Examples / References |
---|---|---|---|
Resource Group | User Activities | Especially for productive environments it may be valuable to get notified of any manual changes (e.g. to make sure they are properly reflected in documentation or deployment scripts) | Alert Rule (Bicep) |
Function App | Duration | If running in consumption plan, the duration is limited to 5 minutes (default) and can be extended to maximally 10 minutes. An alert on durations of more than e.g. 80% can help to detect issues early and thus avoids unhandled timeout failures in production. | Alert Rule (Bicep) |
App Service | Server Errors | Unhandled exception leads to HTTP 500 results on HTTP-triggered Functions or Web Apps. This is okay on pre-production systems but should be analyzed for prevention of the failure or appropriate error handling | Alert Rule (Bicep) |
App Service | Error-Rate | This may indicate systematic problems (e.g. configuration failures) typically after a deployment | Alert Rule (Bicep) |
App Service | CPU / Memory / Disk Usage | For Functions on a dedicated App Service Plan or Web Apps without configured auto-scaling this should be monitor to prevent overload situations | - |
Application Insights | Smart Detection | The above described smart detection rules can now be migrated to regular alerts, which improves the capabilities for processing | Alert Rule (Bicep) |
Application Insights | Requests | Last execution > x time: For specific use cases this may provide a valid measure to detect failures | KQL query to summarize a metric for the last workday |
Application Insights | Availability Tests | As explained in the text above, this is a great feature to continuously observe endpoints | Classic or Standard Availability Test with Alert Rule (Bicep) |
Cognitive Search | Index Size | Depending on the used plan, the number of indexes and especially the available storage is very limited and may cause problems in production. Unfortunately, these metrics are not yet logged - creating a regular metric- or log-based Alert Rule is not yet possible |
- |
Data Factory | Pipeline Executions | Inform about automatically triggered but failed executions (e.g. of integration or backup jobs) | Alert Rule (Bicep) |
Logic App | Executions | Inform about automatically triggered but failed executions (e.g. of integration or backup jobs) | Alert Rule (Bicep) |
Service Bus | Dead Letter Queue | Inform about final cancelled and sorted out messages | Alert Rule (Bicep) |
API Management | Capacity | API Management in non-Consumption plans need to be scaled manually (or with auto-scale rules). The Capacity metric is the appropriate information to evaluate scaling needs. |
Alert Rule (Bicep) |
SQL Database | DTU Model | Databases in the cost-efficient DTU model need to be scaled manually, which can be determined using the DTU Percentage metric |
Alert Rule (Bicep) |
Cosmos DB | Manual/Provisioned Throughput Model | For DB accounts not beeing in the serverless or autoscale model, the provisioned capacity should be observed and adjusted when needed. The Normalized RU Consumption metric is the appropriate information to evaluate scaling needs. |
Alert Rule (Bicep) |
A full-fledged Bicep module to deploy e selection of standard alert rule can be found here: Standard Alert Rules