Monitoring & Alerting Overview

This document provides a comprehensive overview of the monitoring and alerting infrastructure at ChainSafe. It covers the architecture, components, alert routing, and on-call procedures.

Architecture

Our monitoring stack consists of multiple layers:

┌────────────────────────────────────────────────────────┐
│                    Production Services                 │
│   (Filecoin, Polkadot, Ethereum, IPFS, Canton, etc.)   │
┌────────────────────────────────────────────────────────┐
│                    Metrics Collection                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ Node Exporter│  │ Promtail     │  │ Grafana Alloy│  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└────────────────────────────────────────────────────────┘
│                  Metrics & Logs Storage                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ Prometheus   │  │ Loki         │  │ Grafana Cloud│  │
│  │ (self-hosted)│  │ (self-hosted)│  │              │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└────────────────────────────────────────────────────────┘
│                        Alerting                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ Prometheus   │  │ Alertmanager │  │ PagerDuty    │  │
│  │ Alert Rules  │  │              │  │              │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└────────────────────────────────────────────────────────┘
│                    Visualization                       │
│  ┌──────────────┐  ┌──────────────┐                    │
│  │ Grafana Cloud│  │ Grafana Cloud│                    │
│  │ (Dashboards) │  │ (Alerts)     │                    │
│  └──────────────┘  └──────────────┘                    │
└────────────────────────────────────────────────────────┘

Components

Self-Hosted Infrastructure

Prometheus

Location: Self-hosted on infrastructure metrics server
Purpose: Metrics collection and storage
Retention: 30 days
URL: https://prometheus.chainsafe.dev
Documentation: See Metrics Documentation

Loki

Location: Self-hosted on infrastructure metrics server
Purpose: Log aggregation and storage
Retention: 30 days
URL: https://loki.chainsafe.dev
Documentation: See Logging Documentation

Alertmanager

Location: Self-hosted on infrastructure metrics server
Purpose: Alert routing and notification management
URL: https://alertmanager.chainsafe.dev
Configuration: Routes alerts to PagerDuty based on project labels

Grafana Cloud

Grafana Cloud Stack

Primary Stack: chainsafe.grafana.net (US)
Secondary Stack: chainsafe02.grafana.net (EU)
Purpose: Visualization, dashboards, and additional alerting
Access: GitHub OAuth authentication
Documentation: See individual observability docs

Alert Routing

Alerts are routed based on the project_name label in Prometheus metrics:

Project Name	Routing Destination	Integration Type	Notes
`filecoin`	PagerDuty - Filecoin High Priority	PagerDuty Integration Key	Routes alerts from `infrastructure-general`, `infra-ansible`, and `fil-ansible-collection` repositories
`polkadot`	PagerDuty - Polkadot High Priority	PagerDuty Integration Key
`ethereum`	PagerDuty - Ethereum High Priority	PagerDuty Integration Key
`ipfs`	PagerDuty - IPFS High Priority	PagerDuty Integration Key
`canton`	PagerDuty - Canton High Priority	PagerDuty Integration Key
`wallet-connect`	PagerDuty - WalletConnect High Priority	PagerDuty Integration Key
`zkverify`	PagerDuty - ZKVerify High Priority	PagerDuty Integration Key
`celestia`	Alert routing needs to be configured	To be added	Currently using `project_name: "celestia"` but no Alertmanager route exists
`aztec`	Alert routing needs to be configured	To be added	Currently using `project_name: "aztec"` but no Alertmanager route exists
`forest-staging`	Slack - forest-infra-staging channel	Slack Webhook
`misc`, `monad`	Slack - misc-infra-incidents channel	Slack Webhook

Note on Filecoin Infrastructure: The Filecoin project is currently deployed from multiple repositories (infrastructure-general, infra-ansible, and fil-ansible-collection) due to historical infrastructure drift. All repositories use project_name: "filecoin" in their Prometheus targets, ensuring all Filecoin alerts route to the same PagerDuty integration regardless of source repository. Plans are in place to consolidate back to a single source of truth.

Alert Severity Levels

Currently, all production alerts route to "high priority" PagerDuty integrations. There are TODOs in the Alertmanager configuration to:

Add low priority alert receivers
Refine receivers with alert severity labels

Alert Flow

Prometheus/Alertmanager Flow

Prometheus evaluates alert rules based on metrics
Alertmanager receives alerts and routes them based on labels
PagerDuty receives critical alerts and pages on-call engineers
Slack receives staging/non-critical alerts for visibility

Grafana Cloud Alert Flow

Grafana Cloud evaluates alert rules based on metrics from Grafana Cloud Prometheus
Grafana Alerting routes alerts directly to PagerDuty contact points
PagerDuty receives critical alerts and pages on-call engineers

Note: Filecoin snapshot alerts are managed via Grafana Cloud (configured in infrastructure-general/terraform/grafana-cloud/filecoin.tf) and route to the same PagerDuty integration as Prometheus alerts, ensuring consistent alerting coverage regardless of alert source.

Key Files

Alertmanager Configuration

Ansible Collection: ansible-collection/roles/alertmanager/templates/alertmanager.yml
Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/alertmanager_rules.yml

Prometheus Alert Rules

Ansible Collection: ansible-collection/roles/prometheus/templates/rules.yml
Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml

Grafana Cloud Alerts

Terraform: infrastructure-general/terraform/grafana-cloud/
Filecoin Snapshot Alerts: Active Grafana Cloud alerts monitor Filecoin snapshot services deployed from both infrastructure-general and infra-ansible repositories:
- FilecoinSnapshotAgeOld - Alerts when latest snapshot is older than 120 minutes
- FilecoinOrphanArchiveFile - Alerts when orphan files are detected in snapshot archive
- FilecoinSnapshotNoUpload - Alerts when no snapshots have been uploaded in 2 hours
- These alerts route to the same PagerDuty integration (pd-fil-infra-incidents-high) as Prometheus alerts
- Configuration: infrastructure-general/terraform/grafana-cloud/filecoin.tf

Logging - How to send logs to Loki
Metrics - How to send metrics to Prometheus
Tracing - How to send traces to Grafana Cloud
Profiling - How to set up continuous profiling
Alerting & On-Call Guide - Detailed alerting procedures and on-call runbook

Architecture​

Components​

Self-Hosted Infrastructure​

Prometheus​

Loki​

Alertmanager​

Grafana Cloud​

Grafana Cloud Stack​

Alert Routing​

Alert Severity Levels​

Alert Flow​

Prometheus/Alertmanager Flow​

Grafana Cloud Alert Flow​

Key Files​

Alertmanager Configuration​

Prometheus Alert Rules​

Grafana Cloud Alerts​

Related Documentation​