Alerting & On-Call Guide

This guide covers how alerts work, how to manage them, and on-call procedures for the infrastructure team.

Alert Routing to PagerDuty

How It Works

Prometheus evaluates alert rules and generates alerts
Alertmanager receives alerts and routes them based on labels
Alerts are sent to PagerDuty based on the project_name label
PagerDuty pages the on-call engineer based on the configured schedule

Alert Routing Configuration

The Alertmanager configuration routes alerts based on the project_name label in Prometheus metrics:

routes:
  - receiver: "pd-fil-infra-incidents-high"
    matchers:
      - project_name=~"filecoin"
  
  - receiver: "pd-dot-infra-incidents-high"
    matchers:
      - project_name=~"polkadot"
  
  # ... additional routes

PagerDuty Integration Keys

Each project has its own PagerDuty integration key stored as Ansible variables:

pd_fil_infra_incidents_key_high - Filecoin (used by infrastructure-general, infra-ansible, and fil-ansible-collection repositories)
pd_dot_infra_incidents_key_high - Polkadot
pd_ipfs_infra_incidents_key_high - IPFS
ethereum_infra_incidents_key_high - Ethereum
pd_canton_infra_incidents_key_high - Canton
pd_walletconnect_infra_incidents_high - WalletConnect
pd_zkverify_infra_incidents_high - ZKVerify

Note: These are stored in Ansible Vault and should be kept secure.

Filecoin Repository Note: The Filecoin project uses multiple separate Ansible repositories (infrastructure-general/ansible/filecoin-execution, infra-ansible, and fil-ansible-collection) due to historical infrastructure drift. All repositories must use the same PagerDuty integration key (pd_fil_infra_incidents_key_high) and set project_name: "filecoin" in Prometheus targets to ensure consistent alert routing. This multi-repository setup is temporary until infrastructure consolidation is complete.

Alert Severity

Currently, all production alerts route to "high priority" PagerDuty integrations. The Alertmanager configuration includes TODOs to:

Add low priority alert receivers
Refine receivers with alert severity labels

Recommended Severity Levels

When adding new alerts, consider these severity levels:

critical: Immediate action required, service is down or severely degraded
warning: Action required soon, service may be impacted
info: Informational, no immediate action required

Managing Alerts

Viewing Active Alerts

Alertmanager UI: https://alertmanager.chainsafe.dev
- View all active alerts
- See alert grouping and routing
- Silence alerts temporarily
Grafana Cloud: https://chainsafe.grafana.net/alerting
- View Grafana-managed alerts
- See alert history and evaluation status
Prometheus: https://prometheus.chainsafe.dev/alerts
- View all alert rules and their current state

Silencing Alerts

Using Alertmanager UI

Navigate to https://alertmanager.chainsafe.dev
Click "New Silence"
Configure matchers (e.g., alertname=InstanceDown, instance=server-1)
Set duration
Add comment explaining why the alert is silenced
Click "Create"

Using amtool (CLI)

# Silence an alert
amtool silence add alertname=InstanceDown instance=server-1 --duration=1h --comment="Scheduled maintenance"

# List active silences
amtool silence query

# Expire a silence
amtool silence expire <silence-id>

Updating Alert Rules

Alert rules are defined in Prometheus configuration files:

Self-hosted Prometheus:
- Location: infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml
- Update the template and run the Ansible playbook
Grafana Cloud Alerts:
- Location: infrastructure-general/terraform/grafana-cloud/
- Update Terraform files and apply changes

On-Call Procedures

Receiving a Page

When you receive a PagerDuty page:

Acknowledge the incident in PagerDuty
Review the alert details:
- Check the alert name and description
- Review the runbook URL (if provided in alert annotations)
- Check the Prometheus query URL to see current metrics
Assess the severity:
- Is the service actually down?
- How many users/services are affected?
- Is this a false positive?
Take action:
- Follow the runbook for the specific alert
- Check logs in Grafana Cloud
- Review metrics dashboards
- Escalate if needed

Escalation Path

Primary On-Call: First responder (you)
Secondary On-Call: If primary is unavailable
Team Lead: For critical issues requiring additional resources
External Support: Cloud provider support, vendor support, etc.

Post-Incident

After resolving an incident:

Resolve the incident in PagerDuty
Document the incident:
- What happened?
- Root cause?
- Resolution steps?
- Prevention measures?
Update runbooks if procedures changed
Review alert thresholds if it was a false positive or missed alert

Common Alerts & Runbooks

Infrastructure Alerts

These alerts are documented in the General Infrastructure Runbook:

InstanceDown: Server is not operational
HostOutOfMemory: Host is running out of memory
HostDiskWillFillIn24Hours: Disk space is nearly full
HostOomKillDetected: Out-of-memory kill detected
HostRequiresReboot: System requires a reboot
PrometheusRuleEvaluationFailures: Prometheus rule evaluation errors

Project-Specific Alerts

Each project has its own runbook:

Lodestar: Lodestar Runbook
Filecoin: Filecoin Runbook
Polkadot: Polkadot Runbook
B3: B3 Runbook

Adding New Alerts

Step 1: Define the Alert Rule

Add the alert to the Prometheus rules file:

- alert: MyNewAlert
  expr: my_metric > threshold
  for: 5m
  labels:
    severity: critical
    project_name: "my-project"
  annotations:
    summary: "Alert summary"
    description: "Detailed description"
    runbook: "https://infra-docs.chainsafe.dev/docs/..."

Step 2: Ensure Proper Labeling

Make sure your metrics include the project_name label so alerts route correctly:

- labels:
    scrape_location: my_service
    job: my_service
    instance: "server-1"
    project_name: "my-project"  # This determines routing
    targets:
    - "server-1:8080"

Step 3: Add Alertmanager Route (if needed)

If routing to a new PagerDuty integration, add a route in Alertmanager:

routes:
  - receiver: "pd-myproject-infra-incidents-high"
    matchers:
      - project_name=~"my-project"

receivers:
  - name: "pd-myproject-infra-incidents-high"
    pagerduty_configs:
      - routing_key: "{{ pd_myproject_infra_incidents_key_high }}"
        send_resolved: true

Step 4: Create/Update Runbook

Document the alert resolution steps in the appropriate runbook.

Step 5: Test the Alert

Trigger the alert condition (safely)
Verify it routes to the correct PagerDuty integration
Confirm the alert details are clear
Test the resolution steps

Best Practices

Alert Design

Be specific: Alert names and descriptions should clearly indicate what's wrong
Include context: Add labels and annotations that help diagnose the issue
Link to runbooks: Always include a runbook URL in alert annotations
Set appropriate thresholds: Avoid alert fatigue from false positives
Use appropriate for duration: Don't alert on transient issues

On-Call Practices

Respond promptly: Acknowledge pages within SLA (typically 15 minutes)
Communicate: Update PagerDuty notes with your investigation progress
Document: Add notes about what you tried and what worked
Escalate early: Don't struggle alone if you're stuck
Follow runbooks: They exist for a reason

Alert Maintenance

Review regularly: Check for stale or unused alerts
Update thresholds: Adjust based on actual behavior
Remove false positives: Don't let noisy alerts desensitize the team
Add missing alerts: If incidents happen without alerts, add them

Troubleshooting

Alerts Not Firing

Check if the alert rule is syntactically correct
Verify the PromQL expression returns data
Check if the for duration has passed
Review Prometheus logs for rule evaluation errors

Alerts Not Routing to PagerDuty

Verify the project_name label matches a route matcher
Check Alertmanager logs for routing errors
Verify PagerDuty integration key is correct
Test PagerDuty integration connectivity

Too Many Alerts

Review alert thresholds - may be too sensitive
Check for duplicate alerts
Consider grouping related alerts
Review and silence known issues appropriately

Resources

Alertmanager UI: https://alertmanager.chainsafe.dev
Prometheus UI: https://prometheus.chainsafe.dev
Grafana Cloud: https://chainsafe.grafana.net
PagerDuty: https://chainsafe.pagerduty.com
Runbooks: See project-specific documentation

Alert Routing to PagerDuty​

How It Works​

Alert Routing Configuration​

PagerDuty Integration Keys​

Alert Severity​

Recommended Severity Levels​

Managing Alerts​

Viewing Active Alerts​

Silencing Alerts​

Using Alertmanager UI​

Using amtool (CLI)​

Updating Alert Rules​

On-Call Procedures​

Receiving a Page​

Escalation Path​

Post-Incident​

Common Alerts & Runbooks​

Infrastructure Alerts​

Project-Specific Alerts​

Adding New Alerts​

Step 1: Define the Alert Rule​

Step 2: Ensure Proper Labeling​

Step 3: Add Alertmanager Route (if needed)​

Step 4: Create/Update Runbook​

Step 5: Test the Alert​

Best Practices​

Alert Design​

On-Call Practices​

Alert Maintenance​

Troubleshooting​

Alerts Not Firing​

Alerts Not Routing to PagerDuty​

Too Many Alerts​

Resources​