Monitoring & Alerting Overview
This document provides a comprehensive overview of the monitoring and alerting infrastructure at ChainSafe. It covers the architecture, components, alert routing, and on-call procedures.
Architecture
Our monitoring stack consists of multiple layers:
┌────────────────────────────────────────────────────────┐
│ Production Services │
│ (Filecoin, Polkadot, Ethereum, IPFS, Canton, etc.) │
┌────────────────────────────────────────────────────────┐
│ Metrics Collection │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node Exporter│ │ Promtail │ │ Grafana Alloy│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘
│ Metrics & Logs Storage │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Loki │ │ Grafana Cloud│ │
│ │ (self-hosted)│ │ (self-hosted)│ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘
│ Alerting │
│ ┌───── ─────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Alertmanager │ │ PagerDuty │ │
│ │ Alert Rules │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘
│ Visualization │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Grafana Cloud│ │ Grafana Cloud│ │
│ │ (Dashboards) │ │ (Alerts) │ │
│ └──────────────┘ └──────────────┘ │
└────────────────── ──────────────────────────────────────┘
Components
Self-Hosted Infrastructure
Prometheus
- Location: Self-hosted on infrastructure metrics server
- Purpose: Metrics collection and storage
- Retention: 30 days
- URL:
https://prometheus.chainsafe.dev - Documentation: See Metrics Documentation
Loki
- Location: Self-hosted on infrastructure metrics server
- Purpose: Log aggregation and storage
- Retention: 30 days
- URL:
https://loki.chainsafe.dev - Documentation: See Logging Documentation
Alertmanager
- Location: Self-hosted on infrastructure metrics server
- Purpose: Alert routing and notification management
- URL:
https://alertmanager.chainsafe.dev - Configuration: Routes alerts to PagerDuty based on project labels
Grafana Cloud
Grafana Cloud Stack
- Primary Stack:
chainsafe.grafana.net(US) - Secondary Stack:
chainsafe02.grafana.net(EU) - Purpose: Visualization, dashboards, and additional alerting
- Access: GitHub OAuth authentication
- Documentation: See individual observability docs
Alert Routing
Alerts are routed based on the project_name label in Prometheus metrics:
| Project Name | Routing Destination | Integration Type | Notes |
|---|---|---|---|
filecoin | PagerDuty - Filecoin High Priority | PagerDuty Integration Key | Routes alerts from infrastructure-general, infra-ansible, and fil-ansible-collection repositories |
polkadot | PagerDuty - Polkadot High Priority | PagerDuty Integration Key | |
ethereum | PagerDuty - Ethereum High Priority | PagerDuty Integration Key | |
ipfs | PagerDuty - IPFS High Priority | PagerDuty Integration Key | |
canton | PagerDuty - Canton High Priority | PagerDuty Integration Key | |
wallet-connect | PagerDuty - WalletConnect High Priority | PagerDuty Integration Key | |
zkverify | PagerDuty - ZKVerify High Priority | PagerDuty Integration Key | |
celestia | Alert routing needs to be configured | To be added | Currently using project_name: "celestia" but no Alertmanager route exists |
aztec | Alert routing needs to be configured | To be added | Currently using project_name: "aztec" but no Alertmanager route exists |
forest-staging | Slack - forest-infra-staging channel | Slack Webhook | |
misc, monad | Slack - misc-infra-incidents channel | Slack Webhook |
Note on Filecoin Infrastructure: The Filecoin project is currently deployed from multiple repositories (
infrastructure-general,infra-ansible, andfil-ansible-collection) due to historical infrastructure drift. All repositories useproject_name: "filecoin"in their Prometheus targets, ensuring all Filecoin alerts route to the same PagerDuty integration regardless of source repository. Plans are in place to consolidate back to a single source of truth.
Alert Severity Levels
Currently, all production alerts route to "high priority" PagerDuty integrations. There are TODOs in the Alertmanager configuration to:
- Add low priority alert receivers
- Refine receivers with alert severity labels
Alert Flow
Prometheus/Alertmanager Flow
- Prometheus evaluates alert rules based on metrics
- Alertmanager receives alerts and routes them based on labels
- PagerDuty receives critical alerts and pages on-call engineers
- Slack receives staging/non-critical alerts for visibility
Grafana Cloud Alert Flow
- Grafana Cloud evaluates alert rules based on metrics from Grafana Cloud Prometheus
- Grafana Alerting routes alerts directly to PagerDuty contact points
- PagerDuty receives critical alerts and pages on-call engineers
Note: Filecoin snapshot alerts are managed via Grafana Cloud (configured in
infrastructure-general/terraform/grafana-cloud/filecoin.tf) and route to the same PagerDuty integration as Prometheus alerts, ensuring consistent alerting coverage regardless of alert source.
Key Files
Alertmanager Configuration
- Ansible Collection:
ansible-collection/roles/alertmanager/templates/alertmanager.yml - Infrastructure General:
infrastructure-general/ansible/general/playbooks/templates/alertmanager_rules.yml
Prometheus Alert Rules
- Ansible Collection:
ansible-collection/roles/prometheus/templates/rules.yml - Infrastructure General:
infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml
Grafana Cloud Alerts
- Terraform:
infrastructure-general/terraform/grafana-cloud/ - Filecoin Snapshot Alerts: Active Grafana Cloud alerts monitor Filecoin snapshot services deployed from both
infrastructure-generalandinfra-ansiblerepositories:FilecoinSnapshotAgeOld- Alerts when latest snapshot is older than 120 minutesFilecoinOrphanArchiveFile- Alerts when orphan files are detected in snapshot archiveFilecoinSnapshotNoUpload- Alerts when no snapshots have been uploaded in 2 hours- These alerts route to the same PagerDuty integration (
pd-fil-infra-incidents-high) as Prometheus alerts - Configuration:
infrastructure-general/terraform/grafana-cloud/filecoin.tf
Related Documentation
- Logging - How to send logs to Loki
- Metrics - How to send metrics to Prometheus
- Tracing - How to send traces to Grafana Cloud
- Profiling - How to set up continuous profiling
- Alerting & On-Call Guide - Detailed alerting procedures and on-call runbook