Skip to main content

Filecoin Runbook


Snapshot Service Down

📝 Description:
We have two servers dedicated to the Filecoin mainnet and calibnet snapshot services:

  • fil-prod-ovh-forest-snapshot-0:
    • Used for continuous snapshot uploads every 1 hours.
  • fil-prod-ovh-forest-snapshot-1:
    • Functions as a backup if the primary fails.
    • Handles on-demand snapshot generation.

🔧 Action Steps:

  1. Enable the backup service:
    • Set the cronjob_enabled variable to true in the filecoin-execution hosts.ini file 📄 here.
  2. Deploy to the backup host:
    make deploy-fil-snapshot HOSTS=fil-prod-ovh-forest-snapshot-1
  3. Verify the cron job:
    Log into the server and confirm the cron job is set correctly:
    sudo crontab -e 
  4. Notify the project driver for further investigation.

Filecoin Snapshot Age Old

📝 Description:
A snapshot service issue has been detected in one of the network mainnet or calibnet, indicating the snapshot has not been uploaded in the last 4 hours.

🔧 Action Steps:

  1. Go to the GitHub Workflow here.
  2. Keep the branch set to main and select the desired network.
  3. Monitor the process using Loki logs at chainsafe.grafana.net with the following filters:
    • instance=fil-prod-ovh-forest-snapshot-1
    • job=snapshot-service
  4. Notify the project driver for further investigation.

Filecoin Lotus Syncing Fail

📝 Description:
A Lotus node syncing issue has been detected, indicating the node is behind by at least 10 epochs.

🔧 Action Steps:

  1. Monitor the situation closely:
    • It might be a temporary network lag.
    • Use the Lotus dashboard on Grafana for insights into node performance.
  2. Verify Lotus version:
    • Ensure we are running the latest Lotus node version by checking the dashboard or running:
      lotus info
  3. Restart the Lotus Node:
    • Re-run the appropriate Ansible playbook to address syncing issues.
  4. Escalate if unresolved:
    • Notify the team for further investigation.

Forest Tipsets Validated Per Minute

📝 Description:
A low tipset validation rate has been detected on a Forest Node. This may indicate syncing issues or performance degradation.

🔧 Action Steps:

  1. Monitor performance:
    • It could be a temporary network lag.
    • Use the Forest dashboard on Grafana for insights into performance.
  2. Check Forest node version:
    • Ensure we're running the latest version by executing:
      forest-cli info show
  3. Restart the Forest Node:
    • Restart the container to address tipset validation issues.
  4. Escalate if unresolved:
    • Notify the team for further investigation.

Lotus Peer Connections

📝 Description:
A low number of peer connections has been detected on the Lotus node, indicating potential network isolation or connectivity issues.

🔧 Action Steps:

  1. Check connectivity:
    • Ensure there are no network issues with the Filecoin node.
  2. Verify node versions:
    • Confirm the latest versions are running by executing:
      forest-cli info show
      or
      lotus info
  3. Restart the Filecoin Node:
    • Use Ansible playbooks to resolve peer connection issues.
  4. Escalate if unresolved:
    • Notify the team for deeper investigation.

🔄 Continuous Improvement

This runbook provides a structured approach to addressing Filecoin node alerts and issues.
✅ Periodically review and update this document to incorporate new insights, tools, and evolving procedures.