Skip to main content

Monitoring & Alerting Overview

This document provides a comprehensive overview of the monitoring and alerting infrastructure at ChainSafe. It covers the architecture, components, alert routing, and on-call procedures.

Architecture

Our monitoring stack consists of multiple layers:

┌────────────────────────────────────────────────────────┐
│ Production Services │
│ (Filecoin, Polkadot, Ethereum, IPFS, Canton, etc.) │
┌────────────────────────────────────────────────────────┐
│ Metrics Collection │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node Exporter│ │ Promtail │ │ Grafana Alloy│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘
│ Metrics & Logs Storage │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Loki │ │ Grafana Cloud│ │
│ │ (self-hosted)│ │ (self-hosted)│ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘
│ Alerting │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Alertmanager │ │ PagerDuty │ │
│ │ Alert Rules │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘
│ Visualization │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Grafana Cloud│ │ Grafana Cloud│ │
│ │ (Dashboards) │ │ (Alerts) │ │
│ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘

Components

Self-Hosted Infrastructure

Prometheus

  • Location: Self-hosted on infrastructure metrics server
  • Purpose: Metrics collection and storage
  • Retention: 30 days
  • URL: https://prometheus.chainsafe.dev
  • Documentation: See Metrics Documentation

Loki

  • Location: Self-hosted on infrastructure metrics server
  • Purpose: Log aggregation and storage
  • Retention: 30 days
  • URL: https://loki.chainsafe.dev
  • Documentation: See Logging Documentation

Alertmanager

  • Location: Self-hosted on infrastructure metrics server
  • Purpose: Alert routing and notification management
  • URL: https://alertmanager.chainsafe.dev
  • Configuration: Routes alerts to PagerDuty based on project labels

Grafana Cloud

Grafana Cloud Stack

  • Primary Stack: chainsafe.grafana.net (US)
  • Secondary Stack: chainsafe02.grafana.net (EU)
  • Purpose: Visualization, dashboards, and additional alerting
  • Access: GitHub OAuth authentication
  • Documentation: See individual observability docs

Alert Routing

Alerts are routed based on the project_name label in Prometheus metrics:

Project NameRouting DestinationIntegration TypeNotes
filecoinPagerDuty - Filecoin High PriorityPagerDuty Integration KeyRoutes alerts from infrastructure-general, infra-ansible, and fil-ansible-collection repositories
polkadotPagerDuty - Polkadot High PriorityPagerDuty Integration Key
ethereumPagerDuty - Ethereum High PriorityPagerDuty Integration Key
ipfsPagerDuty - IPFS High PriorityPagerDuty Integration Key
cantonPagerDuty - Canton High PriorityPagerDuty Integration Key
wallet-connectPagerDuty - WalletConnect High PriorityPagerDuty Integration Key
zkverifyPagerDuty - ZKVerify High PriorityPagerDuty Integration Key
celestiaAlert routing needs to be configuredTo be addedCurrently using project_name: "celestia" but no Alertmanager route exists
aztecAlert routing needs to be configuredTo be addedCurrently using project_name: "aztec" but no Alertmanager route exists
forest-stagingSlack - forest-infra-staging channelSlack Webhook
misc, monadSlack - misc-infra-incidents channelSlack Webhook

Note on Filecoin Infrastructure: The Filecoin project is currently deployed from multiple repositories (infrastructure-general, infra-ansible, and fil-ansible-collection) due to historical infrastructure drift. All repositories use project_name: "filecoin" in their Prometheus targets, ensuring all Filecoin alerts route to the same PagerDuty integration regardless of source repository. Plans are in place to consolidate back to a single source of truth.

Alert Severity Levels

Currently, all production alerts route to "high priority" PagerDuty integrations. There are TODOs in the Alertmanager configuration to:

  • Add low priority alert receivers
  • Refine receivers with alert severity labels

Alert Flow

Prometheus/Alertmanager Flow

  1. Prometheus evaluates alert rules based on metrics
  2. Alertmanager receives alerts and routes them based on labels
  3. PagerDuty receives critical alerts and pages on-call engineers
  4. Slack receives staging/non-critical alerts for visibility

Grafana Cloud Alert Flow

  1. Grafana Cloud evaluates alert rules based on metrics from Grafana Cloud Prometheus
  2. Grafana Alerting routes alerts directly to PagerDuty contact points
  3. PagerDuty receives critical alerts and pages on-call engineers

Note: Filecoin snapshot alerts are managed via Grafana Cloud (configured in infrastructure-general/terraform/grafana-cloud/filecoin.tf) and route to the same PagerDuty integration as Prometheus alerts, ensuring consistent alerting coverage regardless of alert source.

Key Files

Alertmanager Configuration

  • Ansible Collection: ansible-collection/roles/alertmanager/templates/alertmanager.yml
  • Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/alertmanager_rules.yml

Prometheus Alert Rules

  • Ansible Collection: ansible-collection/roles/prometheus/templates/rules.yml
  • Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml

Grafana Cloud Alerts

  • Terraform: infrastructure-general/terraform/grafana-cloud/
  • Filecoin Snapshot Alerts: Active Grafana Cloud alerts monitor Filecoin snapshot services deployed from both infrastructure-general and infra-ansible repositories:
    • FilecoinSnapshotAgeOld - Alerts when latest snapshot is older than 120 minutes
    • FilecoinOrphanArchiveFile - Alerts when orphan files are detected in snapshot archive
    • FilecoinSnapshotNoUpload - Alerts when no snapshots have been uploaded in 2 hours
    • These alerts route to the same PagerDuty integration (pd-fil-infra-incidents-high) as Prometheus alerts
    • Configuration: infrastructure-general/terraform/grafana-cloud/filecoin.tf