Skip to main content

Alerting & On-Call Guide

This guide covers how alerts work, how to manage them, and on-call procedures for the infrastructure team.

Alert Routing to PagerDuty

How It Works

  1. Prometheus evaluates alert rules and generates alerts
  2. Alertmanager receives alerts and routes them based on labels
  3. Alerts are sent to PagerDuty based on the project_name label
  4. PagerDuty pages the on-call engineer based on the configured schedule

Alert Routing Configuration

The Alertmanager configuration routes alerts based on the project_name label in Prometheus metrics:

routes:
- receiver: "pd-fil-infra-incidents-high"
matchers:
- project_name=~"filecoin"

- receiver: "pd-dot-infra-incidents-high"
matchers:
- project_name=~"polkadot"

# ... additional routes

PagerDuty Integration Keys

Each project has its own PagerDuty integration key stored as Ansible variables:

  • pd_fil_infra_incidents_key_high - Filecoin (used by infrastructure-general, infra-ansible, and fil-ansible-collection repositories)
  • pd_dot_infra_incidents_key_high - Polkadot
  • pd_ipfs_infra_incidents_key_high - IPFS
  • ethereum_infra_incidents_key_high - Ethereum
  • pd_canton_infra_incidents_key_high - Canton
  • pd_walletconnect_infra_incidents_high - WalletConnect
  • pd_zkverify_infra_incidents_high - ZKVerify

Note: These are stored in Ansible Vault and should be kept secure.

Filecoin Repository Note: The Filecoin project uses multiple separate Ansible repositories (infrastructure-general/ansible/filecoin-execution, infra-ansible, and fil-ansible-collection) due to historical infrastructure drift. All repositories must use the same PagerDuty integration key (pd_fil_infra_incidents_key_high) and set project_name: "filecoin" in Prometheus targets to ensure consistent alert routing. This multi-repository setup is temporary until infrastructure consolidation is complete.

Alert Severity

Currently, all production alerts route to "high priority" PagerDuty integrations. The Alertmanager configuration includes TODOs to:

  • Add low priority alert receivers
  • Refine receivers with alert severity labels

When adding new alerts, consider these severity levels:

  • critical: Immediate action required, service is down or severely degraded
  • warning: Action required soon, service may be impacted
  • info: Informational, no immediate action required

Managing Alerts

Viewing Active Alerts

  1. Alertmanager UI: https://alertmanager.chainsafe.dev

    • View all active alerts
    • See alert grouping and routing
    • Silence alerts temporarily
  2. Grafana Cloud: https://chainsafe.grafana.net/alerting

    • View Grafana-managed alerts
    • See alert history and evaluation status
  3. Prometheus: https://prometheus.chainsafe.dev/alerts

    • View all alert rules and their current state

Silencing Alerts

Using Alertmanager UI

  1. Navigate to https://alertmanager.chainsafe.dev
  2. Click "New Silence"
  3. Configure matchers (e.g., alertname=InstanceDown, instance=server-1)
  4. Set duration
  5. Add comment explaining why the alert is silenced
  6. Click "Create"

Using amtool (CLI)

# Silence an alert
amtool silence add alertname=InstanceDown instance=server-1 --duration=1h --comment="Scheduled maintenance"

# List active silences
amtool silence query

# Expire a silence
amtool silence expire <silence-id>

Updating Alert Rules

Alert rules are defined in Prometheus configuration files:

  1. Self-hosted Prometheus:

    • Location: infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml
    • Update the template and run the Ansible playbook
  2. Grafana Cloud Alerts:

    • Location: infrastructure-general/terraform/grafana-cloud/
    • Update Terraform files and apply changes

On-Call Procedures

Receiving a Page

When you receive a PagerDuty page:

  1. Acknowledge the incident in PagerDuty
  2. Review the alert details:
    • Check the alert name and description
    • Review the runbook URL (if provided in alert annotations)
    • Check the Prometheus query URL to see current metrics
  3. Assess the severity:
    • Is the service actually down?
    • How many users/services are affected?
    • Is this a false positive?
  4. Take action:
    • Follow the runbook for the specific alert
    • Check logs in Grafana Cloud
    • Review metrics dashboards
    • Escalate if needed

Escalation Path

  1. Primary On-Call: First responder (you)
  2. Secondary On-Call: If primary is unavailable
  3. Team Lead: For critical issues requiring additional resources
  4. External Support: Cloud provider support, vendor support, etc.

Post-Incident

After resolving an incident:

  1. Resolve the incident in PagerDuty
  2. Document the incident:
    • What happened?
    • Root cause?
    • Resolution steps?
    • Prevention measures?
  3. Update runbooks if procedures changed
  4. Review alert thresholds if it was a false positive or missed alert

Common Alerts & Runbooks

Infrastructure Alerts

These alerts are documented in the General Infrastructure Runbook:

  • InstanceDown: Server is not operational
  • HostOutOfMemory: Host is running out of memory
  • HostDiskWillFillIn24Hours: Disk space is nearly full
  • HostOomKillDetected: Out-of-memory kill detected
  • HostRequiresReboot: System requires a reboot
  • PrometheusRuleEvaluationFailures: Prometheus rule evaluation errors

Project-Specific Alerts

Each project has its own runbook:

Adding New Alerts

Step 1: Define the Alert Rule

Add the alert to the Prometheus rules file:

- alert: MyNewAlert
expr: my_metric > threshold
for: 5m
labels:
severity: critical
project_name: "my-project"
annotations:
summary: "Alert summary"
description: "Detailed description"
runbook: "https://infra-docs.chainsafe.dev/docs/..."

Step 2: Ensure Proper Labeling

Make sure your metrics include the project_name label so alerts route correctly:

- labels:
scrape_location: my_service
job: my_service
instance: "server-1"
project_name: "my-project" # This determines routing
targets:
- "server-1:8080"

Step 3: Add Alertmanager Route (if needed)

If routing to a new PagerDuty integration, add a route in Alertmanager:

routes:
- receiver: "pd-myproject-infra-incidents-high"
matchers:
- project_name=~"my-project"

receivers:
- name: "pd-myproject-infra-incidents-high"
pagerduty_configs:
- routing_key: "{{ pd_myproject_infra_incidents_key_high }}"
send_resolved: true

Step 4: Create/Update Runbook

Document the alert resolution steps in the appropriate runbook.

Step 5: Test the Alert

  1. Trigger the alert condition (safely)
  2. Verify it routes to the correct PagerDuty integration
  3. Confirm the alert details are clear
  4. Test the resolution steps

Best Practices

Alert Design

  • Be specific: Alert names and descriptions should clearly indicate what's wrong
  • Include context: Add labels and annotations that help diagnose the issue
  • Link to runbooks: Always include a runbook URL in alert annotations
  • Set appropriate thresholds: Avoid alert fatigue from false positives
  • Use appropriate for duration: Don't alert on transient issues

On-Call Practices

  • Respond promptly: Acknowledge pages within SLA (typically 15 minutes)
  • Communicate: Update PagerDuty notes with your investigation progress
  • Document: Add notes about what you tried and what worked
  • Escalate early: Don't struggle alone if you're stuck
  • Follow runbooks: They exist for a reason

Alert Maintenance

  • Review regularly: Check for stale or unused alerts
  • Update thresholds: Adjust based on actual behavior
  • Remove false positives: Don't let noisy alerts desensitize the team
  • Add missing alerts: If incidents happen without alerts, add them

Troubleshooting

Alerts Not Firing

  1. Check if the alert rule is syntactically correct
  2. Verify the PromQL expression returns data
  3. Check if the for duration has passed
  4. Review Prometheus logs for rule evaluation errors

Alerts Not Routing to PagerDuty

  1. Verify the project_name label matches a route matcher
  2. Check Alertmanager logs for routing errors
  3. Verify PagerDuty integration key is correct
  4. Test PagerDuty integration connectivity

Too Many Alerts

  1. Review alert thresholds - may be too sensitive
  2. Check for duplicate alerts
  3. Consider grouping related alerts
  4. Review and silence known issues appropriately

Resources

  • Alertmanager UI: https://alertmanager.chainsafe.dev
  • Prometheus UI: https://prometheus.chainsafe.dev
  • Grafana Cloud: https://chainsafe.grafana.net
  • PagerDuty: https://chainsafe.pagerduty.com
  • Runbooks: See project-specific documentation