General Run-book
This document outlines procedures and steps to address common alerts and issues.
HostDiskWillFillIn24Hours
Description: Alert triggered when disk space is nearly full.
Action:
- Confirm all disks are still attached to the server.
- Perform a filesystem cleanup to free up space, Usually by clearing the EL chaindata and resyncing.
InstanceDown
Description: A server is not operational.
Action:
- Determine the deployment environment (AWS, Hetzner, Contabo, Netcup).
- Attempt to ssh into the server, if it fails proceed to the next step.
- Attempt to restart the server.
- If unsuccessful, consider reaching out to the provider.
HostOutOfMemory
Description: The host is running out of memory.
Action:
- Check for a BeaconNodeMemoryLeakDetected alert.
- Review other processes to ensure memory usage is within expected parameters.
- Restart processes and monitor to see if the issue persists.
HostOomKillDetected
Description: An Out-Of-Memory (OOM) kill event has been detected on the host.
Action:
- Verify if a BeaconNodeMemoryLeakDetected alert is also firing.
- Assess memory usage of other processes to confirm normal operation.
- Restart affected processes and monitor the system for stability.
HostRequiresReboot
Description: The host system requires a reboot.
Action:
- Schedule the reboot for a safe time, ensuring it doesn't conflict with critical operations like sync committee duties and block proposal.
This template serves as a structured approach to addressing various alerts and issues that may arise. It's crucial to periodically review and update this runbook to incorporate new insights, tools, and procedures for handling evolving challenges.