Skip to main content

General Run-book

This document outlines procedures and steps to address common alerts and issues.

HostDiskWillFillIn24Hours

Description: Alert triggered when disk space is nearly full.

Action:

  1. Confirm all disks are still attached to the server.
  2. Perform a filesystem cleanup to free up space, Usually by clearing the EL chaindata and resyncing.

InstanceDown

Description: A server is not operational.

Action:

  1. Determine the deployment environment (AWS, Hetzner, Contabo, Netcup).
  2. Attempt to ssh into the server, if it fails proceed to the next step.
  3. Attempt to restart the server.
  4. If unsuccessful, consider reaching out to the provider.

HostOutOfMemory

Description: The host is running out of memory.

Action:

  1. Check for a BeaconNodeMemoryLeakDetected alert.
  2. Review other processes to ensure memory usage is within expected parameters.
  3. Restart processes and monitor to see if the issue persists.

HostOomKillDetected

Description: An Out-Of-Memory (OOM) kill event has been detected on the host.

Action:

  1. Verify if a BeaconNodeMemoryLeakDetected alert is also firing.
  2. Assess memory usage of other processes to confirm normal operation.
  3. Restart affected processes and monitor the system for stability.

HostRequiresReboot

Description: The host system requires a reboot.

Action:

  1. Schedule the reboot for a safe time, ensuring it doesn't conflict with critical operations like sync committee duties and block proposal.

This template serves as a structured approach to addressing various alerts and issues that may arise. It's crucial to periodically review and update this runbook to incorporate new insights, tools, and procedures for handling evolving challenges.

  • HostDiskWillFillIn24Hours
  • InstanceDown
  • HostOutOfMemory
  • HostOomKillDetected
  • HostRequiresReboot