Operational Monitoring Guide - Bud Stack Documentation

Overview

This guide describes how to use Dashboard as the first stop for operations triage and proactive monitoring.

Monitoring Playbook

1. Daily health triage

Check request trend direction.
Check readiness cards (running endpoints, inactive clusters).
Scan latency and throughput deltas.

2. Project hotspot identification

Use API Calls chart to locate high-volume projects.
Compare their latency and throughput behavior.
Prioritize investigation where volume and degradation overlap.

3. Model quality and efficiency checks

Compare top model usage with accuracy leaderboard.
Validate whether high-usage models remain quality-competitive.
Track token changes to catch rising inference cost patterns.

Incident-Oriented Workflow

When a degradation is reported:

Confirm signal in Dashboard.
Identify whether issue is global, project-specific, or model-specific.
Drill into the corresponding module (Projects, Models, Clusters, Observability).
Implement corrective action.
Verify stabilization using the same dashboard window.

Escalation Heuristics

Escalate to platform engineering when latency and throughput degrade across multiple projects.
Escalate to model governance when accuracy drops on top-used models.
Escalate to FinOps when token output growth persists over multiple windows.

Recommended Team Rituals

Start standups with a 3-minute dashboard snapshot.

Use one owner per red signal to avoid ambiguous accountability.

Archive weekly screenshot summaries for trend continuity.

Dashboard Metrics ReferenceReference for dashboard cards, chart metrics, and interpretation hints

Overview
Monitoring Playbook
1. Daily health triage
2. Project hotspot identification
3. Model quality and efficiency checks
Incident-Oriented Workflow
Escalation Heuristics
Recommended Team Rituals