Monitoring Components Inventory
This document provides a comprehensive inventory of all monitoring components, exporters, and tools deployed across the infrastructure.
Core Monitoring Stack
Metrics Collection
| Component | Purpose | Location | Documentation |
|---|---|---|---|
| Prometheus | Metrics collection and storage | Self-hosted (prometheus.chainsafe.dev) | Metrics Guide |
| Node Exporter | Host/OS metrics (CPU, memory, disk, network) | Deployed on all servers | Node Exporter Role |
| Grafana Alloy | Metrics and logs collection agent | Deployed on servers | Grafana Alloy Role |
| Grafana Agent | Legacy metrics/logs collector (EOL Nov 2025) | Some servers | Grafana Agent Role |
Log Collection
| Component | Purpose | Location | Documentation |
|---|---|---|---|
| Loki | Log aggregation and storage | Self-hosted (loki.chainsafe.dev) | Logging Guide |
| Promtail | Log shipper to Loki | Deployed on all servers | Promtail Role |
| Grafana Alloy | Also collects logs (replacing Promtail) | Newer deployments | Grafana Alloy Role |
Alerting
| Component | Purpose | Location | Documentation |
|---|---|---|---|
| Alertmanager | Alert routing and notification | Self-hosted (alertmanager.chainsafe.dev) | Alerting Guide |
| PagerDuty | On-call incident management | Cloud service | Alerting Guide |
| Slack | Non-critical alert notifications | Cloud service | Alerting Guide |
Visualization
| Component | Purpose | Location | Documentation |
|---|---|---|---|
| Grafana Cloud (US) | Dashboards and visualization | chainsafe.grafana.net | Monitoring Overview |
| Grafana Cloud (EU) | Dashboards and visualization | chainsafe02.grafana.net | Monitoring Overview |
Additional Observability
| Component | Purpose | Location | Documentation |
|---|---|---|---|
| Pyroscope | Continuous profiling | Self-hosted | Profiling Guide |
| Grafana Tempo | Distributed tracing | Grafana Cloud | Tracing Guide |
| Mimir | Long-term metrics storage (optional) | Self-hosted | Mimir Role |
Application-Specific Exporters & Monitors
Blockchain Node Monitors
| Component | Purpose | Projects | Role/Implementation |
|---|---|---|---|
| Filecoin Bootnode Monitor | Monitors Filecoin bootnode connectivity | Filecoin | filecoin_bootnode_monitor role |
| Nebula Monitor | Monitors Nebula network metrics | Filecoin | nebula role (includes monitor) |
| Validator Watcher | Monitors Ethereum validator status | Ethereum/Lodestar | validator_watcher role |
| Wallet Balance Alert | Monitors wallet balances | Various | wallet_balance_alert role |
Service-Specific Exporters
| Component | Purpose | Projects | Role/Implementation |
|---|---|---|---|
| Cloudflare Exporter | Exports Cloudflare metrics | Infrastructure | cloudflare_exporter_docker_role |
| Polkadot Node Metrics | Native Polkadot metrics | Polkadot | Built into polkadot_node_docker_role |
| Lotus Metrics | Filecoin Lotus node metrics | Filecoin | Built into lotus_fullnode_docker_role |
| Forest Metrics | Filecoin Forest node metrics | Filecoin | Built into forest_fullnode_docker_role |
| Beacon Node Metrics | Ethereum beacon node metrics | Ethereum/Lodestar | Built into beacon role |
| Execution Node Metrics | Ethereum execution node metrics | Ethereum/Lodestar | Built into execution role |
Kubernetes Monitoring
EKS Cluster Monitoring
| Component | Purpose | Location | Configuration |
|---|---|---|---|
| k8s-monitoring Helm Chart | Kubernetes metrics collection | EKS clusters | infrastructure-general/terraform/k8s/eks/argocd-apps/charts/management/observability/ |
| Grafana Alloy (K8s) | Metrics and logs from Kubernetes | EKS clusters | Via k8s-monitoring chart |
| Cluster Metrics | Kubernetes cluster metrics | EKS clusters | Enabled in observability values |
| Pod Logs | Container log collection | EKS clusters | Enabled in observability values |
| Cluster Events | Kubernetes event collection | EKS clusters | Enabled in observability values |
Grafana Cloud Integrations
Synthetic Monitoring
| Component | Purpose | Projects | Configuration |
|---|---|---|---|
| Filecoin Snapshot Check | Monitors Filecoin snapshot service | Filecoin | infrastructure-general/terraform/grafana-cloud/filecoin.tf |
| Forest Staging Alerts | Grafana Cloud alerts for Forest staging | Filecoin | infrastructure-general/terraform/grafana-cloud/forest-staging.tf |
Alert Rules in Grafana Cloud
Some projects have alert rules managed directly in Grafana Cloud via Terraform:
- Filecoin snapshot monitoring
- Forest staging environment alerts
Monitoring Ports
Common ports used for monitoring:
| Port | Service | Purpose |
|---|---|---|
9100 | Node Exporter | Host metrics (internal) |
39100 | Node Exporter | Host metrics (public) |
9090 | Prometheus | Prometheus UI (internal) |
9093 | Alertmanager | Alertmanager UI (internal) |
39093 | Alertmanager | Alertmanager UI (public) |
3100 | Loki | Loki API (internal) |
9615 | Polkadot | Polkadot metrics (internal) |
39615 | Polkadot | Polkadot metrics (public) |
Monitoring Data Flow
Metrics Flow
Application/Node → Node Exporter/App Metrics → Prometheus → Alertmanager → PagerDuty
↓
Grafana Cloud (dashboards)
Logs Flow
Application Containers → Promtail/Grafana Alloy → Loki → Grafana Cloud
Kubernetes Flow
K8s Pods → Grafana Alloy → infra-prometheus/infra-loki → Grafana Cloud
Configuration Files
Alertmanager Configuration
- Ansible Collection:
ansible-collection/roles/alertmanager/templates/alertmanager.yml - Infrastructure General:
infrastructure-general/ansible/general/playbooks/templates/alertmanager_rules.yml
Prometheus Configuration
- Ansible Collection:
ansible-collection/roles/prometheus/templates/prometheus.yml - Infrastructure General:
infrastructure-general/ansible/general/playbooks/templates/prometheus_config.yml
Alert Rules
- Ansible Collection:
ansible-collection/roles/prometheus/templates/rules.yml - Infrastructure General:
infrastructure-general/ansible/general/playbooks/templates/prom_rules.yml
Grafana Cloud Configuration
- Terraform:
infrastructure-general/terraform/grafana-cloud/ - Filecoin Snapshot Alerts: Active Grafana Cloud alert rules for Filecoin snapshot monitoring
- Location:
infrastructure-general/terraform/grafana-cloud/filecoin.tf - Alerts:
FilecoinSnapshotAgeOld,FilecoinOrphanArchiveFile,FilecoinSnapshotNoUpload - Routes to: PagerDuty
pd-fil-infra-incidents-highintegration
- Location:
Repository Structure
Infrastructure Repositories
The infrastructure is managed across multiple repositories:
infrastructure-general- Primary consolidated repository containing most infrastructure projectsansible-collection- Reusable Ansible roles and collectionsinfra-ansible- Legacy Filecoin deployment repository (see note below)fil-ansible-collection- Legacy Filecoin deployment repository (see note below)infra-kubernetes- Kubernetes infrastructure configurationslodestar-ansible-development- Lodestar development environment and DevOps supportlodestar-ansible-production- Lodestar production environment including Lido fleet (managed fromprodfolder)
⚠️ Filecoin Repository Note:
The Filecoin project is currently deployed from multiple separate repositories due to historical infrastructure drift:
- Primary:
infrastructure-general/ansible/filecoin-execution- Legacy:
infra-ansible- Legacy:
fil-ansible-collectionAll repositories route alerts to the same PagerDuty integration (
pd-fil-infra-incidents-high) by usingproject_name: "filecoin"in Prometheus targets. Plans are in place to consolidate back to a single source of truth in the future.
Adding New Monitoring
For a New Service
- Add metrics endpoint to your service
- Add Prometheus target using the
prometheusrole'supdate_targets.ymltask - Ensure proper labels including
project_namefor alert routing - Add alert rules if needed
- Update runbook with alert resolution steps
For a New Project
- Set up PagerDuty integration key
- Add Alertmanager route for the new project
- Configure project labels in Prometheus targets
- Create project runbook in documentation
Maintenance Tasks
Regular Reviews
- Review and update alert thresholds quarterly
- Audit unused or stale alerts monthly
- Review PagerDuty integration keys annually
- Update runbooks after incidents
- Review monitoring coverage for new services
Health Checks
- Verify Prometheus is scraping all targets
- Check Alertmanager is routing alerts correctly
- Verify PagerDuty integrations are working
- Check Grafana Cloud dashboards are accessible
- Review log retention and storage usage
Related Documentation
- Monitoring Overview - High-level architecture
- Alerting & On-Call Guide - Alert management procedures
- Metrics Guide - How to send metrics
- Logging Guide - How to send logs
- Tracing Guide - How to send traces
- Profiling Guide - How to set up profiling