Skip to main content

Monitoring Components Inventory

This document provides a comprehensive inventory of all monitoring components, exporters, and tools deployed across the infrastructure.

Core Monitoring Stack

Metrics Collection

ComponentPurposeLocationDocumentation
PrometheusMetrics collection and storageSelf-hosted (prometheus.chainsafe.dev)Metrics Guide
Node ExporterHost/OS metrics (CPU, memory, disk, network)Deployed on all serversNode Exporter Role
Grafana AlloyMetrics and logs collection agentDeployed on serversGrafana Alloy Role
Grafana AgentLegacy metrics/logs collector (EOL Nov 2025)Some serversGrafana Agent Role

Log Collection

ComponentPurposeLocationDocumentation
LokiLog aggregation and storageSelf-hosted (loki.chainsafe.dev)Logging Guide
PromtailLog shipper to LokiDeployed on all serversPromtail Role
Grafana AlloyAlso collects logs (replacing Promtail)Newer deploymentsGrafana Alloy Role

Alerting

ComponentPurposeLocationDocumentation
AlertmanagerAlert routing and notificationSelf-hosted (alertmanager.chainsafe.dev)Alerting Guide
PagerDutyOn-call incident managementCloud serviceAlerting Guide
SlackNon-critical alert notificationsCloud serviceAlerting Guide

Visualization

ComponentPurposeLocationDocumentation
Grafana Cloud (US)Dashboards and visualizationchainsafe.grafana.netMonitoring Overview
Grafana Cloud (EU)Dashboards and visualizationchainsafe02.grafana.netMonitoring Overview

Additional Observability

ComponentPurposeLocationDocumentation
PyroscopeContinuous profilingSelf-hostedProfiling Guide
Grafana TempoDistributed tracingGrafana CloudTracing Guide
MimirLong-term metrics storage (optional)Self-hostedMimir Role

Application-Specific Exporters & Monitors

Blockchain Node Monitors

ComponentPurposeProjectsRole/Implementation
Filecoin Bootnode MonitorMonitors Filecoin bootnode connectivityFilecoinfilecoin_bootnode_monitor role
Nebula MonitorMonitors Nebula network metricsFilecoinnebula role (includes monitor)
Validator WatcherMonitors Ethereum validator statusEthereum/Lodestarvalidator_watcher role
Wallet Balance AlertMonitors wallet balancesVariouswallet_balance_alert role

Service-Specific Exporters

ComponentPurposeProjectsRole/Implementation
Cloudflare ExporterExports Cloudflare metricsInfrastructurecloudflare_exporter_docker_role
Polkadot Node MetricsNative Polkadot metricsPolkadotBuilt into polkadot_node_docker_role
Lotus MetricsFilecoin Lotus node metricsFilecoinBuilt into lotus_fullnode_docker_role
Forest MetricsFilecoin Forest node metricsFilecoinBuilt into forest_fullnode_docker_role
Beacon Node MetricsEthereum beacon node metricsEthereum/LodestarBuilt into beacon role
Execution Node MetricsEthereum execution node metricsEthereum/LodestarBuilt into execution role

Kubernetes Monitoring

EKS Cluster Monitoring

ComponentPurposeLocationConfiguration
k8s-monitoring Helm ChartKubernetes metrics collectionEKS clustersinfrastructure-general/terraform/k8s/eks/argocd-apps/charts/management/observability/
Grafana Alloy (K8s)Metrics and logs from KubernetesEKS clustersVia k8s-monitoring chart
Cluster MetricsKubernetes cluster metricsEKS clustersEnabled in observability values
Pod LogsContainer log collectionEKS clustersEnabled in observability values
Cluster EventsKubernetes event collectionEKS clustersEnabled in observability values

Grafana Cloud Integrations

Synthetic Monitoring

ComponentPurposeProjectsConfiguration
Filecoin Snapshot CheckMonitors Filecoin snapshot serviceFilecoininfrastructure-general/terraform/grafana-cloud/filecoin.tf
Forest Staging AlertsGrafana Cloud alerts for Forest stagingFilecoininfrastructure-general/terraform/grafana-cloud/forest-staging.tf

Alert Rules in Grafana Cloud

Some projects have alert rules managed directly in Grafana Cloud via Terraform:

  • Filecoin snapshot monitoring
  • Forest staging environment alerts

Monitoring Ports

Common ports used for monitoring:

PortServicePurpose
9100Node ExporterHost metrics (internal)
39100Node ExporterHost metrics (public)
9090PrometheusPrometheus UI (internal)
9093AlertmanagerAlertmanager UI (internal)
39093AlertmanagerAlertmanager UI (public)
3100LokiLoki API (internal)
9615PolkadotPolkadot metrics (internal)
39615PolkadotPolkadot metrics (public)

Monitoring Data Flow

Metrics Flow

Application/Node → Node Exporter/App Metrics → Prometheus → Alertmanager → PagerDuty

Grafana Cloud (dashboards)

Logs Flow

Application Containers → Promtail/Grafana Alloy → Loki → Grafana Cloud

Kubernetes Flow

K8s Pods → Grafana Alloy → infra-prometheus/infra-loki → Grafana Cloud

Configuration Files

Alertmanager Configuration

  • Ansible Collection: ansible-collection/roles/alertmanager/templates/alertmanager.yml
  • Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/alertmanager_rules.yml

Prometheus Configuration

  • Ansible Collection: ansible-collection/roles/prometheus/templates/prometheus.yml
  • Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/prometheus_config.yml

Alert Rules

  • Ansible Collection: ansible-collection/roles/prometheus/templates/rules.yml
  • Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml

Grafana Cloud Configuration

  • Terraform: infrastructure-general/terraform/grafana-cloud/
  • Filecoin Snapshot Alerts: Active Grafana Cloud alert rules for Filecoin snapshot monitoring
    • Location: infrastructure-general/terraform/grafana-cloud/filecoin.tf
    • Alerts: FilecoinSnapshotAgeOld, FilecoinOrphanArchiveFile, FilecoinSnapshotNoUpload
    • Routes to: PagerDuty pd-fil-infra-incidents-high integration

Repository Structure

Infrastructure Repositories

The infrastructure is managed across multiple repositories:

  • infrastructure-general - Primary consolidated repository containing most infrastructure projects
  • ansible-collection - Reusable Ansible roles and collections
  • infra-ansible - Legacy Filecoin deployment repository (see note below)
  • fil-ansible-collection - Legacy Filecoin deployment repository (see note below)
  • infra-kubernetes - Kubernetes infrastructure configurations
  • lodestar-ansible-development - Lodestar development environment and DevOps support
  • lodestar-ansible-production - Lodestar production environment including Lido fleet (managed from prod folder)

⚠️ Filecoin Repository Note:
The Filecoin project is currently deployed from multiple separate repositories due to historical infrastructure drift:

  • Primary: infrastructure-general/ansible/filecoin-execution
  • Legacy: infra-ansible
  • Legacy: fil-ansible-collection

All repositories route alerts to the same PagerDuty integration (pd-fil-infra-incidents-high) by using project_name: "filecoin" in Prometheus targets. Plans are in place to consolidate back to a single source of truth in the future.

Adding New Monitoring

For a New Service

  1. Add metrics endpoint to your service
  2. Add Prometheus target using the prometheus role's update_targets.yml task
  3. Ensure proper labels including project_name for alert routing
  4. Add alert rules if needed
  5. Update runbook with alert resolution steps

For a New Project

  1. Set up PagerDuty integration key
  2. Add Alertmanager route for the new project
  3. Configure project labels in Prometheus targets
  4. Create project runbook in documentation

Maintenance Tasks

Regular Reviews

  • Review and update alert thresholds quarterly
  • Audit unused or stale alerts monthly
  • Review PagerDuty integration keys annually
  • Update runbooks after incidents
  • Review monitoring coverage for new services

Health Checks

  • Verify Prometheus is scraping all targets
  • Check Alertmanager is routing alerts correctly
  • Verify PagerDuty integrations are working
  • Check Grafana Cloud dashboards are accessible
  • Review log retention and storage usage