Alerting & On-Call Guide
This guide covers how alerts work, how to manage them, and on-call procedures for the infrastructure team.
Alert Routing to PagerDuty
How It Works
- Prometheus evaluates alert rules and generates alerts
- Alertmanager receives alerts and routes them based on labels
- Alerts are sent to PagerDuty based on the
project_namelabel - PagerDuty pages the on-call engineer based on the configured schedule
Alert Routing Configuration
The Alertmanager configuration routes alerts based on the project_name label in Prometheus metrics:
routes:
- receiver: "pd-fil-infra-incidents-high"
matchers:
- project_name=~"filecoin"
- receiver: "pd-dot-infra-incidents-high"
matchers:
- project_name=~"polkadot"
# ... additional routes
PagerDuty Integration Keys
Each project has its own PagerDuty integration key stored as Ansible variables:
pd_fil_infra_incidents_key_high- Filecoin (used byinfrastructure-general,infra-ansible, andfil-ansible-collectionrepositories)pd_dot_infra_incidents_key_high- Polkadotpd_ipfs_infra_incidents_key_high- IPFSethereum_infra_incidents_key_high- Ethereumpd_canton_infra_incidents_key_high- Cantonpd_walletconnect_infra_incidents_high- WalletConnectpd_zkverify_infra_incidents_high- ZKVerify
Note: These are stored in Ansible Vault and should be kept secure.
Filecoin Repository Note: The Filecoin project uses multiple separate Ansible repositories (
infrastructure-general/ansible/filecoin-execution,infra-ansible, andfil-ansible-collection) due to historical infrastructure drift. All repositories must use the same PagerDuty integration key (pd_fil_infra_incidents_key_high) and setproject_name: "filecoin"in Prometheus targets to ensure consistent alert routing. This multi-repository setup is temporary until infrastructure consolidation is complete.
Alert Severity
Currently, all production alerts route to "high priority" PagerDuty integrations. The Alertmanager configuration includes TODOs to:
- Add low priority alert receivers
- Refine receivers with alert severity labels
Recommended Severity Levels
When adding new alerts, consider these severity levels:
- critical: Immediate action required, service is down or severely degraded
- warning: Action required soon, service may be impacted
- info: Informational, no immediate action required
Managing Alerts
Viewing Active Alerts
-
Alertmanager UI:
https://alertmanager.chainsafe.dev- View all active alerts
- See alert grouping and routing
- Silence alerts temporarily
-
Grafana Cloud:
https://chainsafe.grafana.net/alerting- View Grafana-managed alerts
- See alert history and evaluation status
-
Prometheus:
https://prometheus.chainsafe.dev/alerts- View all alert rules and their current state
Silencing Alerts
Using Alertmanager UI
- Navigate to
https://alertmanager.chainsafe.dev - Click "New Silence"
- Configure matchers (e.g.,
alertname=InstanceDown,instance=server-1) - Set duration
- Add comment explaining why the alert is silenced
- Click "Create"
Using amtool (CLI)
# Silence an alert
amtool silence add alertname=InstanceDown instance=server-1 --duration=1h --comment="Scheduled maintenance"
# List active silences
amtool silence query
# Expire a silence
amtool silence expire <silence-id>
Updating Alert Rules
Alert rules are defined in Prometheus configuration files:
-
Self-hosted Prometheus:
- Location:
infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml - Update the template and run the Ansible playbook
- Location:
-
Grafana Cloud Alerts:
- Location:
infrastructure-general/terraform/grafana-cloud/ - Update Terraform files and apply changes
- Location:
On-Call Procedures
Receiving a Page
When you receive a PagerDuty page:
- Acknowledge the incident in PagerDuty
- Review the alert details:
- Check the alert name and description
- Review the runbook URL (if provided in alert annotations)
- Check the Prometheus query URL to see current metrics
- Assess the severity:
- Is the service actually down?
- How many users/services are affected?
- Is this a false positive?
- Take action:
- Follow the runbook for the specific alert
- Check logs in Grafana Cloud
- Review metrics dashboards
- Escalate if needed
Escalation Path
- Primary On-Call: First responder (you)
- Secondary On-Call: If primary is unavailable
- Team Lead: For critical issues requiring additional resources
- External Support: Cloud provider support, vendor support, etc.
Post-Incident
After resolving an incident:
- Resolve the incident in PagerDuty
- Document the incident:
- What happened?
- Root cause?
- Resolution steps?
- Prevention measures?
- Update runbooks if procedures changed
- Review alert thresholds if it was a false positive or missed alert
Common Alerts & Runbooks
Infrastructure Alerts
These alerts are documented in the General Infrastructure Runbook:
- InstanceDown: Server is not operational
- HostOutOfMemory: Host is running out of memory
- HostDiskWillFillIn24Hours: Disk space is nearly full
- HostOomKillDetected: Out-of-memory kill detected
- HostRequiresReboot: System requires a reboot
- PrometheusRuleEvaluationFailures: Prometheus rule evaluation errors
Project-Specific Alerts
Each project has its own runbook:
- Lodestar: Lodestar Runbook
- Filecoin: Filecoin Runbook
- Polkadot: Polkadot Runbook
- B3: B3 Runbook
Adding New Alerts
Step 1: Define the Alert Rule
Add the alert to the Prometheus rules file:
- alert: MyNewAlert
expr: my_metric > threshold
for: 5m
labels:
severity: critical
project_name: "my-project"
annotations:
summary: "Alert summary"
description: "Detailed description"
runbook: "https://infra-docs.chainsafe.dev/docs/..."
Step 2: Ensure Proper Labeling
Make sure your metrics include the project_name label so alerts route correctly:
- labels:
scrape_location: my_service
job: my_service
instance: "server-1"
project_name: "my-project" # This determines routing
targets:
- "server-1:8080"
Step 3: Add Alertmanager Route (if needed)
If routing to a new PagerDuty integration, add a route in Alertmanager:
routes:
- receiver: "pd-myproject-infra-incidents-high"
matchers:
- project_name=~"my-project"
receivers:
- name: "pd-myproject-infra-incidents-high"
pagerduty_configs:
- routing_key: "{{ pd_myproject_infra_incidents_key_high }}"
send_resolved: true
Step 4: Create/Update Runbook
Document the alert resolution steps in the appropriate runbook.
Step 5: Test the Alert
- Trigger the alert condition (safely)
- Verify it routes to the correct PagerDuty integration
- Confirm the alert details are clear
- Test the resolution steps
Best Practices
Alert Design
- Be specific: Alert names and descriptions should clearly indicate what's wrong
- Include context: Add labels and annotations that help diagnose the issue
- Link to runbooks: Always include a runbook URL in alert annotations
- Set appropriate thresholds: Avoid alert fatigue from false positives
- Use appropriate
forduration: Don't alert on transient issues
On-Call Practices
- Respond promptly: Acknowledge pages within SLA (typically 15 minutes)
- Communicate: Update PagerDuty notes with your investigation progress
- Document: Add notes about what you tried and what worked
- Escalate early: Don't struggle alone if you're stuck
- Follow runbooks: They exist for a reason
Alert Maintenance
- Review regularly: Check for stale or unused alerts
- Update thresholds: Adjust based on actual behavior
- Remove false positives: Don't let noisy alerts desensitize the team
- Add missing alerts: If incidents happen without alerts, add them
Troubleshooting
Alerts Not Firing
- Check if the alert rule is syntactically correct
- Verify the PromQL expression returns data
- Check if the
forduration has passed - Review Prometheus logs for rule evaluation errors
Alerts Not Routing to PagerDuty
- Verify the
project_namelabel matches a route matcher - Check Alertmanager logs for routing errors
- Verify PagerDuty integration key is correct
- Test PagerDuty integration connectivity
Too Many Alerts
- Review alert thresholds - may be too sensitive
- Check for duplicate alerts
- Consider grouping related alerts
- Review and silence known issues appropriately
Resources
- Alertmanager UI:
https://alertmanager.chainsafe.dev - Prometheus UI:
https://prometheus.chainsafe.dev - Grafana Cloud:
https://chainsafe.grafana.net - PagerDuty:
https://chainsafe.pagerduty.com - Runbooks: See project-specific documentation