- AI Act Conformity » Accuracy, Robustness and Cybersecurity
- AI Act Conformity » Human Oversight
- AI Act Conformity » Transparency and Provision of Information
- Engineering Practices » Containerization
- Engineering Practices » Model Monitoring
Operational Monitoring¶
Compliance Info
Below we map the engineering practice to articles of the AI Act, which benefit from following the practice.
Implementing an operational metrics solution will help you in achieving compliance with the following requirements of the AI Act:
- Art. 13 (Transparency and Provision of Information to Deployers), in particular:
- Art. 13(3)(e), monitoring the operation of the system enables to provide statistics about the system resource usage
- Art. 14 (Human Oversight), in particular:
- Art. 14(4)(e), continuous monitoring the operation of the systems helps to detect conditions requiring potential intervention
- Art. 15 (Accuracy, Robustness and Cybersecurity), in particular:
- Art. 15(4) robustness, monitoring and alerting can help detect and mitigate potential robustness and availability issues
- Art. 15(5) cybersecurity, since monitoring is a crucial part of threat detection
- Art. 26 (Obligations of Deployers of High-Risk AI Systems), in particular:
- Art. 26(5) (Monitoring of the AI system's operation by the deployer)
- Art. 26(6) (Keeping of system logs by the deployer)
Motivation¶
Besides monitoring the performance of your machine learning models, it is also important to monitor the performance of the underlying technical infrastructure and services. This includes monitoring the health of the servers, databases, and other components that support your machine learning applications.
These operational metrics can give an indication of the overall health of the system and can help you identify potential issues before they become critical. This aligns with the AI Act obligations towards the robustness and cybersecurity of high-risk AI systems.
Implementation Notes¶
Relevant metrics to monitor include:
- Resource usage: CPU, GPU, memory, disk, network
- Service availability: uptime, error rates
- Latency: request/response times
- Throughput: requests per second
- Custom metrics: application-specific metrics (e.g., number of processed records, model inference times)
While it is a crucial part of an operational monitoring solution, this page does not cover the topic of alerting. The following activities can provide a starting point to implement an alerting system:
- Determining the thresholds for the metrics
- Defining the escalation process for alerts
- Setting up the alerting channels (e.g., email, Slack, PagerDuty)
- Deploying and setting up the alerting system
Key Technologies¶
Metrics Collection and Visualization¶
- Prometheus, a time-series database for event and metrics collection, storage, monitoring, and alerting
- Many tools and frameworks can expose their operational metrics in the Prometheus format
- Grafana, an open-source analytics solution for visualization of metrics
- The ELK (Elasticsearch, Logstash, Kibana) stack, in particular:
- Elasticsearch, a distributed search and analytics engine
- Kibana, a data visualization and exploration tool for Elasticsearch
Alerting¶
- Prometheus Alertmanager
- Grafana Alerting