Skip to main content

Overview

A self-hosted Bud deployment has two resource layers, and sizing each one separately is the key to right-sizing your infrastructure:

Platform services

The control plane — dashboard, APIs, gateway, databases, queue, object storage, and monitoring. Memory-led and largely fixed. CPU stays low at rest and rises only with request throughput.

Inference nodes

Where your models actually run. The dominant cost at scale. Provisioned on dedicated inference nodes whose hardware — GPU, CPU, or HPU — is chosen per model based on the model, use case, and latency/throughput (SLO) targets. The platform recommends the node type and configuration when you deploy each model.
Plan the platform services first (a relatively fixed footprint), then add inference nodes of the appropriate type — CPU, GPU, or HPU — for the models you intend to serve. The two layers scale independently.

Choosing inference hardware

You add inference capacity by attaching nodes with the accelerator that best fits each model and its service-level objectives. The platform’s optimizer analyses the model and target SLO and recommends the node type and count.
HardwareTypical fit
GPU (CUDA)Largest models, lowest latency, highest concurrency
CPUSmaller / quantized models, batch or latency-tolerant workloads, cost-sensitive sites
HPU (Intel Gaudi)High-throughput serving where Gaudi accelerators are available
A single deployment can mix node types — for example, GPU nodes for latency-critical models and CPU nodes for smaller or background workloads.

Choosing a deployment size

One machine running the full platform plus one or more models locally. Suitable for evaluations, edge sites, and small teams (up to ~100 concurrent users).
ResourceMinimumRecommended
CPU16 cores32 cores
Memory96 GiB128 GiB
Storage (NVMe SSD)1 TiB2 TiB
Inference acceleratorCPU-only (smaller / quantized models)1 × 48–80 GB GPU for larger models or low-latency serving
The platform services alone use roughly 50–60 GiB of memory, so 64 GiB is not enough once a model and the operating system are added. Likewise, the fixed storage footprint already exceeds 200 GiB before model files — start at 1 TiB.
Smaller or quantized models can serve on CPU. A single 48–80 GB GPU comfortably serves one model in the ~20–30B-parameter range at low latency. Larger models or higher concurrency call for the clustered tier.

Storage planning

Storage falls into three independent components. They scale on completely different axes, so size each one separately rather than picking a single total:
ComponentScales withBounded?
Platform baselineFixed footprint — databases, queue, object-store metadataYes, roughly fixed
Model storageNumber and size of the models you onboardNo — grows with your catalog
Analytics & observabilityRequest rate × retention windowYes — plateaus at the retention window

Platform baseline

Independent of traffic and of your model catalog, a deployment provisions storage for its databases, message queue, and object-store metadata. Plan for roughly 200–500 GiB for this layer before any model files or request data.

Model storage

The model registry — where downloaded model weights live — is usually the largest and least predictable part of total storage, and it scales with your model catalog, not with the deployment tier. Size it explicitly. Per model, registry size ≈ parameters × bytes-per-parameter × variants kept:
PrecisionBytes/param8B70B405B
bf16 / fp162~16 GB~140 GB~810 GB
fp8 / int81~8 GB~70 GB~405 GB
int4 (quantized)0.5~4 GB~35 GB~200 GB
Bud often keeps more than one variant of a model — for example the original weights plus a quantized copy — so multiply by the number of variants you retain. Example registry sizes. Most catalogs fall into a few tiers. Use these as a starting point, then refine with the formula above for your exact model list:
Registry tierExample catalogApprox. catalog sizeSuggested registry capacity
EvaluationOne 8B chat model (bf16) + a small embedding model~20 GB50 GiB
Small catalogA few 7–32B models, one of them quantized, + embeddings and a reranker~50–100 GB200 GiB
ProductionA 70B model kept as both bf16 and int4, + 8–32B models and embeddings~250–500 GB500 GiB – 1 TiB
Large / multi-tenantA quantized 405B model + several 70B/32B variants + a multimodal model + embeddings~1–3 TB2–5 TiB+
The suggested registry capacity column is what to provision for the registry volume and set as MODEL_REGISTRY_MAX_SIZE; it adds headroom above the raw catalog size for variants you add later and for the local download cache. Model weights occupy storage in up to three places; budget for all of them:
  • Registry (durable copy) — one copy per model variant in the object store (S3 / MinIO / rustfs / Ceph). This is what the registry budget, MODEL_REGISTRY_MAX_SIZE, governs.
  • Local download cache — staging on the model-registry volume while a model is fetched, before it is uploaded to the registry. Provision at least your largest model × the number of concurrent downloads.
  • Inference model volume — when a model is deployed, its weights are placed on a volume the inference pool mounts. Use a shared (ReadWriteMany) storage class — NFS, AWS EFS, Azure Files, or similar — so the pool keeps one copy of each model regardless of node count; size it to the models that pool serves. (A node-local ReadWriteOnce volume works for single-node serving, but then each node needs its own copy.) Note the trade-off: shared network-attached storage (NFS, AWS EFS) saves space but can slow model load times at pod startup/scale-up compared with node-local NVMe — back it with fast storage, or use a node-local cache for latency-sensitive cold starts.
The registry runs a pre-flight capacity check before every download, so an over-full registry fails fast instead of part-way through a multi-gigabyte upload. Set MODEL_REGISTRY_MAX_SIZE to your provisioned registry capacity and grow the backing volume as your catalog grows — see Helm Configuration.

Analytics & observability growth

Usage data grows with traffic but is bounded by retention windows, so it reaches a steady state rather than growing forever:
DataWhat it isDefault retention
Inference analyticsPer-request metadata: tokens, latency, model, status90 days
ObservabilityTraces, logs, and metrics across the platform30 days (configurable)
Usage metricsAggregated dashboards and billing rollups90 days
Estimate the growth component with:
steady-state size ≈ bytes-per-request × requests-per-second × 86,400 × retention-days × replication
The default deployment stores per-request metadata only (not full prompt and response text), which keeps this small. As a guide, at a sustained 1,000 requests/second:
ConfigurationSteady-state (3 HA copies)
Metadata only (default)~6 TB, plateaus at the retention window
With full request/response logging enabled~19 TB, plateaus at the retention window
Scale linearly for other rates — e.g. 100 requests/second is roughly one tenth.
Enabling full request/response content logging substantially increases storage. If you turn it on, set an appropriate retention window first and provision accordingly.

Keeping storage predictable

  • Prune unused models and stale quantized variants from the registry — for most deployments the model catalog, not request volume, is the largest storage driver.
  • Tune the observability retention window to your needs (shorter = less storage).
  • Keep full request/response logging off unless you need it, and bound it with a retention window when you do.
  • Use premium SSD/NVMe for databases and the model registry; standard SSD is fine for general application data.
  • For very large datasets, scale the analytics database horizontally rather than relying on replication alone.

Scaling

Platform services

  • The platform layer is memory-led; provision memory generously and treat CPU as elastic.
  • Stateless services support horizontal autoscaling (enable it per service for high availability and burst handling).
  • Run three copies of stateful services for high availability in the clustered and large-scale tiers.

Inference nodes

  • Add inference nodes to increase serving capacity. Choose the accelerator — GPU, CPU, or HPU — based on each model and its SLO; the platform recommends the node type and configuration at deployment time.
  • Scale out by adding model replicas across inference nodes as concurrency grows.
  • Keep inference nodes in a separate node pool from the platform services so the two scale independently, and mix node types to match each model.

Networking

TrafficMinimumRecommended
Between nodes5 Gbps10–40 Gbps (higher for inference pools)
Internet ingress/egress1 Gbps5 Gbps

Next steps

Installation Guide

Deploy the platform on Kubernetes

Helm Configuration

Configure resources, retention, and services

Deployment

Deployment options and workflows