Hardware Requirements - Bud Stack Documentation

Overview

A self-hosted Bud deployment has two resource layers, and sizing each one separately is the key to right-sizing your infrastructure:

Platform services

The control plane — dashboard, APIs, gateway, databases, queue, object storage, and monitoring. Memory-led and largely fixed. CPU stays low at rest and rises only with request throughput.

Inference nodes

Where your models actually run. The dominant cost at scale. Provisioned on dedicated inference nodes whose hardware — GPU, CPU, or HPU — is chosen per model based on the model, use case, and latency/throughput (SLO) targets. The platform recommends the node type and configuration when you deploy each model.

Plan the platform services first (a relatively fixed footprint), then add inference nodes of the appropriate type — CPU, GPU, or HPU — for the models you intend to serve. The two layers scale independently.

Choosing inference hardware

You add inference capacity by attaching nodes with the accelerator that best fits each model and its service-level objectives. The platform’s optimizer analyses the model and target SLO and recommends the node type and count.

Hardware	Typical fit
GPU (CUDA)	Largest models, lowest latency, highest concurrency
CPU	Smaller / quantized models, batch or latency-tolerant workloads, cost-sensitive sites
HPU (Intel Gaudi)	High-throughput serving where Gaudi accelerators are available

A single deployment can mix node types — for example, GPU nodes for latency-critical models and CPU nodes for smaller or background workloads.

Choosing a deployment size

Single Node (AI-in-a-Box)
Clustered (High Availability)
Large-Scale (Multi-Tenant)

One machine running the full platform plus one or more models locally. Suitable for evaluations, edge sites, and small teams (up to ~100 concurrent users).

Resource	Minimum	Recommended
CPU	16 cores	32 cores
Memory	96 GiB	128 GiB
Storage (NVMe SSD)	1 TiB	2 TiB
Inference accelerator	CPU-only (smaller / quantized models)	1 × 48–80 GB GPU for larger models or low-latency serving

The platform services alone use roughly 50–60 GiB of memory, so 64 GiB is not enough once a model and the operating system are added. Likewise, the fixed storage footprint already exceeds 200 GiB before model files — start at 1 TiB.

Smaller or quantized models can serve on CPU. A single 48–80 GB GPU comfortably serves one model in the ~20–30B-parameter range at low latency. Larger models or higher concurrency call for the clustered tier.

A highly-available platform across three or more nodes, with a separate inference worker pool. Suitable for production workloads up to ~1,000 concurrent users and several models served at once.Platform nodes

Resource	Recommended
Node shape	3 × (16–32 vCPU / 64–128 GiB / 500 GiB–1 TiB NVMe)
Total CPU	48–96 vCPU
Total memory	192–384 GiB
Platform storage (node-local)	1.5–3 TiB aggregate (the per-node NVMe above) for databases, queue, and working set
Analytics & observability	Separate volumes sized to your retention window (grows with traffic) — see Storage planning
Model storage (registry)	Separate volume sized to your model catalog — see Storage planning

Inference worker nodes (separate pool)

Resource	Recommended
Accelerators	GPU, CPU, and/or HPU — chosen per model and SLO
Capacity	Sized to the models you serve; the platform recommends node type and count per model
Memory per GPU node	≥ 2× the node’s total GPU memory
Model volume	Shared ReadWriteMany storage class (NFS, AWS EFS, Azure Files, etc.) sized to the models the pool serves — one copy for the whole pool, not per node (see Storage planning)
Networking	10 Gbps between nodes

High availability comes from running three copies of the platform’s stateful services, which is the main increase over the single-node tier. Inference nodes can be a mix of types — for example, GPU nodes for latency-critical models alongside CPU nodes for smaller workloads.

A multi-tenant deployment for 10,000+ users. The platform layer grows modestly; inference capacity and data retention dominate.Platform

Resource	Recommended
Nodes	~6
Total CPU	96–128 vCPU
Total memory	384–512 GiB

Inference capacity

Resource	Recommended
Inference nodes	Many, multi-node — a mix of GPU/CPU/HPU sized to your model catalog, traffic, and SLOs

Storage

Resource	Recommended
Platform + analytics	10–20 TiB+ — grows with request volume and retention
Model storage (registry)	Sized to your model catalog, separate from the above — see Storage planning

At this scale both the model registry and request-analytics/usage data are major storage drivers. Size each independently — see Storage planning below.

Storage planning

Storage falls into three independent components. They scale on completely different axes, so size each one separately rather than picking a single total:

Component	Scales with	Bounded?
Platform baseline	Fixed footprint — databases, queue, object-store metadata	Yes, roughly fixed
Model storage	Number and size of the models you onboard	No — grows with your catalog
Analytics & observability	Request rate × retention window	Yes — plateaus at the retention window

Platform baseline

Independent of traffic and of your model catalog, a deployment provisions storage for its databases, message queue, and object-store metadata. Plan for roughly 200–500 GiB for this layer before any model files or request data.

Model storage

The model registry — where downloaded model weights live — is usually the largest and least predictable part of total storage, and it scales with your model catalog, not with the deployment tier. Size it explicitly. Per model, registry size ≈ parameters × bytes-per-parameter × variants kept:

Precision	Bytes/param	8B	70B	405B
bf16 / fp16	2	~16 GB	~140 GB	~810 GB
fp8 / int8	1	~8 GB	~70 GB	~405 GB
int4 (quantized)	0.5	~4 GB	~35 GB	~200 GB

Bud often keeps more than one variant of a model — for example the original weights plus a quantized copy — so multiply by the number of variants you retain. Example registry sizes. Most catalogs fall into a few tiers. Use these as a starting point, then refine with the formula above for your exact model list:

Registry tier	Example catalog	Approx. catalog size	Suggested registry capacity
Evaluation	One 8B chat model (bf16) + a small embedding model	~20 GB	50 GiB
Small catalog	A few 7–32B models, one of them quantized, + embeddings and a reranker	~50–100 GB	200 GiB
Production	A 70B model kept as both bf16 and int4, + 8–32B models and embeddings	~250–500 GB	500 GiB – 1 TiB
Large / multi-tenant	A quantized 405B model + several 70B/32B variants + a multimodal model + embeddings	~1–3 TB	2–5 TiB+

The suggested registry capacity column is what to provision for the registry volume and set as MODEL_REGISTRY_MAX_SIZE; it adds headroom above the raw catalog size for variants you add later and for the local download cache. Model weights occupy storage in up to three places; budget for all of them:

Registry (durable copy) — one copy per model variant in the object store (S3 / MinIO / rustfs / Ceph). This is what the registry budget, MODEL_REGISTRY_MAX_SIZE, governs.
Local download cache — staging on the model-registry volume while a model is fetched, before it is uploaded to the registry. Provision at least your largest model × the number of concurrent downloads.
Inference model volume — when a model is deployed, its weights are placed on a volume the inference pool mounts. Use a shared (ReadWriteMany) storage class — NFS, AWS EFS, Azure Files, or similar — so the pool keeps one copy of each model regardless of node count; size it to the models that pool serves. (A node-local ReadWriteOnce volume works for single-node serving, but then each node needs its own copy.) Note the trade-off: shared network-attached storage (NFS, AWS EFS) saves space but can slow model load times at pod startup/scale-up compared with node-local NVMe — back it with fast storage, or use a node-local cache for latency-sensitive cold starts.

The registry runs a pre-flight capacity check before every download, so an over-full registry fails fast instead of part-way through a multi-gigabyte upload. Set MODEL_REGISTRY_MAX_SIZE to your provisioned registry capacity and grow the backing volume as your catalog grows — see Helm Configuration.

Analytics & observability growth

Usage data grows with traffic but is bounded by retention windows, so it reaches a steady state rather than growing forever:

Data	What it is	Default retention
Inference analytics	Per-request metadata: tokens, latency, model, status	90 days
Observability	Traces, logs, and metrics across the platform	30 days (configurable)
Usage metrics	Aggregated dashboards and billing rollups	90 days

Estimate the growth component with:

steady-state size ≈ bytes-per-request × requests-per-second × 86,400 × retention-days × replication

The default deployment stores per-request metadata only (not full prompt and response text), which keeps this small. As a guide, at a sustained 1,000 requests/second:

Configuration	Steady-state (3 HA copies)
Metadata only (default)	~6 TB, plateaus at the retention window
With full request/response logging enabled	~19 TB, plateaus at the retention window

Scale linearly for other rates — e.g. 100 requests/second is roughly one tenth.

Enabling full request/response content logging substantially increases storage. If you turn it on, set an appropriate retention window first and provision accordingly.

Keeping storage predictable

Prune unused models and stale quantized variants from the registry — for most deployments the model catalog, not request volume, is the largest storage driver.
Tune the observability retention window to your needs (shorter = less storage).
Keep full request/response logging off unless you need it, and bound it with a retention window when you do.
Use premium SSD/NVMe for databases and the model registry; standard SSD is fine for general application data.
For very large datasets, scale the analytics database horizontally rather than relying on replication alone.

Scaling

Platform services

The platform layer is memory-led; provision memory generously and treat CPU as elastic.
Stateless services support horizontal autoscaling (enable it per service for high availability and burst handling).
Run three copies of stateful services for high availability in the clustered and large-scale tiers.

Inference nodes

Add inference nodes to increase serving capacity. Choose the accelerator — GPU, CPU, or HPU — based on each model and its SLO; the platform recommends the node type and configuration at deployment time.
Scale out by adding model replicas across inference nodes as concurrency grows.
Keep inference nodes in a separate node pool from the platform services so the two scale independently, and mix node types to match each model.

Networking

Traffic	Minimum	Recommended
Between nodes	5 Gbps	10–40 Gbps (higher for inference pools)
Internet ingress/egress	1 Gbps	5 Gbps

Next steps

Installation Guide

Deploy the platform on Kubernetes

Helm Configuration

Configure resources, retention, and services

Deployment

Deployment options and workflows

​Overview

Platform services

Inference nodes

​Choosing inference hardware

​Choosing a deployment size

​Storage planning

​Platform baseline

​Model storage

​Analytics & observability growth

​Keeping storage predictable

​Scaling

​Platform services

​Inference nodes

​Networking

​Next steps

Installation Guide

Helm Configuration

Deployment

Overview

Choosing inference hardware

Choosing a deployment size

Storage planning

Platform baseline

Model storage

Analytics & observability growth

Keeping storage predictable

Scaling

Platform services

Inference nodes

Networking

Next steps