Monitoring Components Inventory

This document provides a comprehensive inventory of all monitoring components, exporters, and tools deployed across the infrastructure.

Core Monitoring Stack

Metrics Collection

Component	Purpose	Location	Documentation
Prometheus	Metrics collection and storage	Self-hosted (`prometheus.chainsafe.dev`)	Metrics Guide
Node Exporter	Host/OS metrics (CPU, memory, disk, network)	Deployed on all servers	Node Exporter Role
Grafana Alloy	Metrics and logs collection agent	Deployed on servers	Grafana Alloy Role
Grafana Agent	Legacy metrics/logs collector (EOL Nov 2025)	Some servers	Grafana Agent Role

Log Collection

Component	Purpose	Location	Documentation
Loki	Log aggregation and storage	Self-hosted (`loki.chainsafe.dev`)	Logging Guide
Promtail	Log shipper to Loki	Deployed on all servers	Promtail Role
Grafana Alloy	Also collects logs (replacing Promtail)	Newer deployments	Grafana Alloy Role

Alerting

Component	Purpose	Location	Documentation
Alertmanager	Alert routing and notification	Self-hosted (`alertmanager.chainsafe.dev`)	Alerting Guide
PagerDuty	On-call incident management	Cloud service	Alerting Guide
Slack	Non-critical alert notifications	Cloud service	Alerting Guide

Visualization

Component	Purpose	Location	Documentation
Grafana Cloud (US)	Dashboards and visualization	`chainsafe.grafana.net`	Monitoring Overview
Grafana Cloud (EU)	Dashboards and visualization	`chainsafe02.grafana.net`	Monitoring Overview

Additional Observability

Component	Purpose	Location	Documentation
Pyroscope	Continuous profiling	Self-hosted	Profiling Guide
Grafana Tempo	Distributed tracing	Grafana Cloud	Tracing Guide
Mimir	Long-term metrics storage (optional)	Self-hosted	Mimir Role

Application-Specific Exporters & Monitors

Blockchain Node Monitors

Component	Purpose	Projects	Role/Implementation
Filecoin Bootnode Monitor	Monitors Filecoin bootnode connectivity	Filecoin	`filecoin_bootnode_monitor` role
Nebula Monitor	Monitors Nebula network metrics	Filecoin	`nebula` role (includes monitor)
Validator Watcher	Monitors Ethereum validator status	Ethereum/Lodestar	`validator_watcher` role
Wallet Balance Alert	Monitors wallet balances	Various	`wallet_balance_alert` role

Service-Specific Exporters

Component	Purpose	Projects	Role/Implementation
Cloudflare Exporter	Exports Cloudflare metrics	Infrastructure	`cloudflare_exporter_docker_role`
Polkadot Node Metrics	Native Polkadot metrics	Polkadot	Built into `polkadot_node_docker_role`
Lotus Metrics	Filecoin Lotus node metrics	Filecoin	Built into `lotus_fullnode_docker_role`
Forest Metrics	Filecoin Forest node metrics	Filecoin	Built into `forest_fullnode_docker_role`
Beacon Node Metrics	Ethereum beacon node metrics	Ethereum/Lodestar	Built into `beacon` role
Execution Node Metrics	Ethereum execution node metrics	Ethereum/Lodestar	Built into `execution` role

Kubernetes Monitoring

EKS Cluster Monitoring

Component	Purpose	Location	Configuration
k8s-monitoring Helm Chart	Kubernetes metrics collection	EKS clusters	`infrastructure-general/terraform/k8s/eks/argocd-apps/charts/management/observability/`
Grafana Alloy (K8s)	Metrics and logs from Kubernetes	EKS clusters	Via k8s-monitoring chart
Cluster Metrics	Kubernetes cluster metrics	EKS clusters	Enabled in observability values
Pod Logs	Container log collection	EKS clusters	Enabled in observability values
Cluster Events	Kubernetes event collection	EKS clusters	Enabled in observability values

Grafana Cloud Integrations

Synthetic Monitoring

Component	Purpose	Projects	Configuration
Filecoin Snapshot Check	Monitors Filecoin snapshot service	Filecoin	`infrastructure-general/terraform/grafana-cloud/filecoin.tf`
Forest Staging Alerts	Grafana Cloud alerts for Forest staging	Filecoin	`infrastructure-general/terraform/grafana-cloud/forest-staging.tf`

Alert Rules in Grafana Cloud

Some projects have alert rules managed directly in Grafana Cloud via Terraform:

Filecoin snapshot monitoring
Forest staging environment alerts

Monitoring Ports

Common ports used for monitoring:

Port	Service	Purpose
`9100`	Node Exporter	Host metrics (internal)
`39100`	Node Exporter	Host metrics (public)
`9090`	Prometheus	Prometheus UI (internal)
`9093`	Alertmanager	Alertmanager UI (internal)
`39093`	Alertmanager	Alertmanager UI (public)
`3100`	Loki	Loki API (internal)
`9615`	Polkadot	Polkadot metrics (internal)
`39615`	Polkadot	Polkadot metrics (public)

Monitoring Data Flow

Metrics Flow

Application/Node → Node Exporter/App Metrics → Prometheus → Alertmanager → PagerDuty
                                                      ↓
                                              Grafana Cloud (dashboards)

Logs Flow

Application Containers → Promtail/Grafana Alloy → Loki → Grafana Cloud

Kubernetes Flow

K8s Pods → Grafana Alloy → infra-prometheus/infra-loki → Grafana Cloud

Configuration Files

Alertmanager Configuration

Ansible Collection: ansible-collection/roles/alertmanager/templates/alertmanager.yml
Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/alertmanager_rules.yml

Prometheus Configuration

Ansible Collection: ansible-collection/roles/prometheus/templates/prometheus.yml
Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/prometheus_config.yml

Alert Rules

Ansible Collection: ansible-collection/roles/prometheus/templates/rules.yml
Infrastructure General: infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml

Grafana Cloud Configuration

Terraform: infrastructure-general/terraform/grafana-cloud/
Filecoin Snapshot Alerts: Active Grafana Cloud alert rules for Filecoin snapshot monitoring
- Location: infrastructure-general/terraform/grafana-cloud/filecoin.tf
- Alerts: FilecoinSnapshotAgeOld, FilecoinOrphanArchiveFile, FilecoinSnapshotNoUpload
- Routes to: PagerDuty pd-fil-infra-incidents-high integration

Repository Structure

Infrastructure Repositories

The infrastructure is managed across multiple repositories:

infrastructure-general - Primary consolidated repository containing most infrastructure projects
ansible-collection - Reusable Ansible roles and collections
infra-ansible - Legacy Filecoin deployment repository (see note below)
fil-ansible-collection - Legacy Filecoin deployment repository (see note below)
infra-kubernetes - Kubernetes infrastructure configurations
lodestar-ansible-development - Lodestar development environment and DevOps support
lodestar-ansible-production - Lodestar production environment including Lido fleet (managed from prod folder)

⚠️ Filecoin Repository Note:
The Filecoin project is currently deployed from multiple separate repositories due to historical infrastructure drift:

Primary: infrastructure-general/ansible/filecoin-execution

Legacy: infra-ansible

Legacy: fil-ansible-collection

All repositories route alerts to the same PagerDuty integration (pd-fil-infra-incidents-high) by using project_name: "filecoin" in Prometheus targets. Plans are in place to consolidate back to a single source of truth in the future.

Adding New Monitoring

For a New Service

Add metrics endpoint to your service
Add Prometheus target using the prometheus role's update_targets.yml task
Ensure proper labels including project_name for alert routing
Add alert rules if needed
Update runbook with alert resolution steps

For a New Project

Set up PagerDuty integration key
Add Alertmanager route for the new project
Configure project labels in Prometheus targets
Create project runbook in documentation

Maintenance Tasks

Regular Reviews

Review and update alert thresholds quarterly
Audit unused or stale alerts monthly
Review PagerDuty integration keys annually
Update runbooks after incidents
Review monitoring coverage for new services

Health Checks

Verify Prometheus is scraping all targets
Check Alertmanager is routing alerts correctly
Verify PagerDuty integrations are working
Check Grafana Cloud dashboards are accessible
Review log retention and storage usage

Monitoring Overview - High-level architecture
Alerting & On-Call Guide - Alert management procedures
Metrics Guide - How to send metrics
Logging Guide - How to send logs
Tracing Guide - How to send traces
Profiling Guide - How to set up profiling

Core Monitoring Stack​

Metrics Collection​

Log Collection​

Alerting​

Visualization​

Additional Observability​

Application-Specific Exporters & Monitors​

Blockchain Node Monitors​

Service-Specific Exporters​

Kubernetes Monitoring​

EKS Cluster Monitoring​

Grafana Cloud Integrations​

Synthetic Monitoring​

Alert Rules in Grafana Cloud​

Monitoring Ports​

Monitoring Data Flow​

Metrics Flow​

Logs Flow​

Kubernetes Flow​

Configuration Files​

Alertmanager Configuration​

Prometheus Configuration​

Alert Rules​

Grafana Cloud Configuration​

Repository Structure​

Infrastructure Repositories​

Adding New Monitoring​

For a New Service​

For a New Project​

Maintenance Tasks​

Regular Reviews​

Health Checks​

Related Documentation​