Operational Monitoring¶

Compliance Info

Below we map the engineering practice to articles of the AI Act, which benefit from following the practice.

Implementing an operational metrics solution will help you in achieving compliance with the following requirements of the AI Act:

Art. 13 (Transparency and Provision of Information to Deployers), in particular:
- Art. 13(3)(e), monitoring the operation of the system enables to provide statistics about the system resource usage
Art. 14 (Human Oversight), in particular:
- Art. 14(4)(e), continuous monitoring the operation of the systems helps to detect conditions requiring potential intervention
Art. 15 (Accuracy, Robustness and Cybersecurity), in particular:
- Art. 15(4) robustness, monitoring and alerting can help detect and mitigate potential robustness and availability issues
- Art. 15(5) cybersecurity, since monitoring is a crucial part of threat detection
Art. 26 (Obligations of Deployers of High-Risk AI Systems), in particular:
- Art. 26(5) (Monitoring of the AI system's operation by the deployer)
- Art. 26(6) (Keeping of system logs by the deployer)

Motivation¶

Besides monitoring the performance of your machine learning models, it is also important to monitor the performance of the underlying technical infrastructure and services. This includes monitoring the health of the servers, databases, and other components that support your machine learning applications.

These operational metrics can give an indication of the overall health of the system and can help you identify potential issues before they become critical. This aligns with the AI Act obligations towards the robustness and cybersecurity of high-risk AI systems.

Implementation Notes¶

Relevant metrics to monitor include:

Resource usage: CPU, GPU, memory, disk, network
Service availability: uptime, error rates
Latency: request/response times
Throughput: requests per second
Custom metrics: application-specific metrics (e.g., number of processed records, model inference times)

While it is a crucial part of an operational monitoring solution, this page does not cover the topic of alerting. The following activities can provide a starting point to implement an alerting system:

Determining the thresholds for the metrics
Defining the escalation process for alerts
Setting up the alerting channels (e.g., email, Slack, PagerDuty)
Deploying and setting up the alerting system

Key Technologies¶

Metrics Collection and Visualization¶

Prometheus, a time-series database for event and metrics collection, storage, monitoring, and alerting
- Many tools and frameworks can expose their operational metrics in the Prometheus format
Grafana, an open-source analytics solution for visualization of metrics
The ELK (Elasticsearch, Logstash, Kibana) stack, in particular:
- Elasticsearch, a distributed search and analytics engine
- Kibana, a data visualization and exploration tool for Elasticsearch

Alerting¶

Prometheus Alertmanager
Grafana Alerting