Skip to main content

Filecoin Project

Background​

The Infrastructure Team is responsible for running several critical infrastructure components of Filecoin Mainnet and Calibnet for Protocol Labs.
These services are network-critical and require high availability and responsive monitoring to ensure seamless performance. βš™οΈπŸ“ˆ


List of Services​


Filecoin Infrastructure Inventory​

Access the complete infrastructure inventory here. πŸ“Š


Deployment & Upgrade Steps​

We utilize Ansible to deploy services in a containerized environment. Whether it’s an initial deployment or an upgrade, the process remains consistent.

πŸ”„ Deployment Type: Recreate

  1. Get the Latest Image Tag

  2. Update the Image Tag
    Update the respective host configurations in:

⚠️ Important Note on Repository Structure:
The Filecoin project is currently deployed from multiple separate repositories due to historical infrastructure drift:

  • infrastructure-general/ansible/filecoin-execution - Primary, consolidated repository
  • infra-ansible - Legacy repository
  • fil-ansible-collection - Legacy repository

Current Status: All repositories are actively used for Filecoin deployments. Both legacy repositories route alerts to the same PagerDuty integration (pd-fil-infra-incidents-high) as the primary repository, ensuring consistent alerting coverage.

Future Plans: There are plans to reintegrate the legacy repositories back into the primary infrastructure-general repository to establish a single source of truth and eliminate infrastructure drift. Until that migration is complete, all repositories must be maintained.

  1. Dry Run the Ansible Command
    Use --diff --check flags to preview the changes before applying them. πŸ› οΈ

  2. Apply the Changes
    Re-run the actual command to deploy the changes. πŸš€

  3. Verify Deployment
    Ensure the service is up and running by performing post-deployment checks. βœ…

  4. Raise a PR
    Submit a Pull Request and request team approval. πŸ”


Monitoring & Alerting​

Alert Sources​

Filecoin infrastructure is monitored through two alerting systems:

  1. Prometheus/Alertmanager Alerts (self-hosted)

    • Node health, sync status, peer connectivity
    • Host metrics (CPU, memory, disk)
    • Routes via project_name: "filecoin" label to PagerDuty
  2. Grafana Cloud Alerts (managed via Terraform)

    • Filecoin snapshot service monitoring
    • Configured in: infrastructure-general/terraform/grafana-cloud/filecoin.tf
    • Active alerts:
      • FilecoinSnapshotAgeOld - Snapshot older than 120 minutes
      • FilecoinOrphanArchiveFile - Orphan files in snapshot archive
      • FilecoinSnapshotNoUpload - No snapshots uploaded in 2 hours
    • Routes to same PagerDuty integration (pd-fil-infra-incidents-high)

Both alerting systems route to the same PagerDuty integration, ensuring consistent on-call coverage.

Runbook for Troubleshooting​

We actively track actionable alerts, each accompanied by detailed steps for resolution. 🚨

πŸ“š Check out the Filecoin Runbook here.