Skip to main content

Overview

After onboarding, operators use Clusters for continuous health management, deployment oversight, and safe lifecycle changes.

Daily Operations Loop

1) Health Monitoring

  • Review General tab for CPU/GPU/memory/storage trends.
  • Check for abrupt drops in available workers or nodes.
  • Use time windows to detect regressions.

2) Deployment Oversight

  • Open Deployments tab to review active workloads.
  • Confirm deployment status, worker counts, and routing behavior.
  • Escalate unhealthy deployments before cluster-wide degradation.
Image

3) Node Diagnostics

  • Use Nodes tab for request-vs-allocatable analysis.
  • Open node event panels for warnings (scheduling failures, connectivity, taints).
  • Prioritize repeated or high-severity events for immediate remediation.

4) Safe Editing and Deletion

  • Use edit actions to update metadata like name or ingress.
  • Before deletion, confirm no active endpoints depend on the cluster.
  • Record change rationale for governance and post-incident reviews.

Incident Response Pattern

Best Practices

Separate production and non-production clusters with clear naming.
Review node events before and after major deployment rollouts.
Avoid destructive operations during unresolved incidents.
Align cluster actions with RBAC and audit policy requirements.