Filecoin Runbook
Snapshot Service Down
📝 Description:
We have two servers dedicated to the Filecoin mainnet and calibnet snapshot services:
- fil-prod-ovh-forest-snapshot-0:
- Used for continuous snapshot uploads every 1 hours.
- fil-prod-ovh-forest-snapshot-1:
- Functions as a backup if the primary fails.
- Handles on-demand snapshot generation.
🔧 Action Steps:
- Enable the backup service:
- Set the
cronjob_enabledvariable totruein thefilecoin-executionhosts.inifile 📄 here.
- Set the
- Deploy to the backup host:
make deploy-fil-snapshot HOSTS=fil-prod-ovh-forest-snapshot-1 - Verify the cron job:
Log into the server and confirm the cron job is set correctly:sudo crontab -e - Notify the project driver for further investigation.
Filecoin Snapshot Age Old
📝 Description:
A snapshot service issue has been detected in one of the network mainnet or calibnet, indicating the snapshot has not been uploaded in the last 4 hours.
🔧 Action Steps:
- Go to the GitHub Workflow here.
- Keep the branch set to
mainand select the desired network. - Monitor the process using Loki logs at chainsafe.grafana.net with the following filters:
instance=fil-prod-ovh-forest-snapshot-1job=snapshot-service
- Notify the project driver for further investigation.
Filecoin Lotus Syncing Fail
📝 Description:
A Lotus node syncing issue has been detected, indicating the node is behind by at least 10 epochs.
🔧 Action Steps:
- Monitor the situation closely:
- It might be a temporary network lag.
- Use the Lotus dashboard on Grafana for insights into node performance.
- Verify Lotus version:
- Ensure we are running the latest Lotus node version by checking the dashboard or running:
lotus info
- Ensure we are running the latest Lotus node version by checking the dashboard or running:
- Restart the Lotus Node:
- Re-run the appropriate Ansible playbook to address syncing issues.
- Escalate if unresolved:
- Notify the team for further investigation.
Forest Tipsets Validated Per Minute
📝 Description:
A low tipset validation rate has been detected on a Forest Node. This may indicate syncing issues or performance degradation.
🔧 Action Steps:
- Monitor performance:
- It could be a temporary network lag.
- Use the Forest dashboard on Grafana for insights into performance.
- Check Forest node version:
- Ensure we're running the latest version by executing:
forest-cli info show
- Ensure we're running the latest version by executing:
- Restart the Forest Node:
- Restart the container to address tipset validation issues.
- Escalate if unresolved:
- Notify the team for further investigation.
Lotus Peer Connections
📝 Description:
A low number of peer connections has been detected on the Lotus node, indicating potential network isolation or connectivity issues.
🔧 Action Steps:
- Check connectivity:
- Ensure there are no network issues with the Filecoin node.
- Verify node versions:
- Confirm the latest versions are running by executing:
or
forest-cli info showlotus info
- Confirm the latest versions are running by executing:
- Restart the Filecoin Node:
- Use Ansible playbooks to resolve peer connection issues.
- Escalate if unresolved:
- Notify the team for deeper investigation.
🔄 Continuous Improvement
This runbook provides a structured approach to addressing Filecoin node alerts and issues.
✅ Periodically review and update this document to incorporate new insights, tools, and evolving procedures.