A self-hosted Bud deployment has two resource layers, and sizing each one
separately is the key to right-sizing your infrastructure:
Platform services
The control plane — dashboard, APIs, gateway, databases, queue, object
storage, and monitoring. Memory-led and largely fixed. CPU stays low at
rest and rises only with request throughput.
Inference nodes
Where your models actually run. The dominant cost at scale. Provisioned
on dedicated inference nodes whose hardware — GPU, CPU, or HPU — is chosen
per model based on the model, use case, and latency/throughput (SLO) targets.
The platform recommends the node type and configuration when you deploy each
model.
Plan the platform services first (a relatively fixed footprint), then add
inference nodes of the appropriate type — CPU, GPU, or HPU — for the models you
intend to serve. The two layers scale independently.
You add inference capacity by attaching nodes with the accelerator that best
fits each model and its service-level objectives. The platform’s optimizer
analyses the model and target SLO and recommends the node type and count.
Hardware
Typical fit
GPU (CUDA)
Largest models, lowest latency, highest concurrency
CPU
Smaller / quantized models, batch or latency-tolerant workloads, cost-sensitive sites
HPU (Intel Gaudi)
High-throughput serving where Gaudi accelerators are available
A single deployment can mix node types — for example, GPU nodes for
latency-critical models and CPU nodes for smaller or background workloads.
One machine running the full platform plus one or more models locally.
Suitable for evaluations, edge sites, and small teams (up to ~100 concurrent
users).
Resource
Minimum
Recommended
CPU
16 cores
32 cores
Memory
96 GiB
128 GiB
Storage (NVMe SSD)
1 TiB
2 TiB
Inference accelerator
CPU-only (smaller / quantized models)
1 × 48–80 GB GPU for larger models or low-latency serving
The platform services alone use roughly 50–60 GiB of memory, so 64 GiB is
not enough once a model and the operating system are added. Likewise, the
fixed storage footprint already exceeds 200 GiB before model files — start at
1 TiB.
Smaller or quantized models can serve on CPU. A single 48–80 GB GPU
comfortably serves one model in the ~20–30B-parameter range at low latency.
Larger models or higher concurrency call for the clustered tier.
A highly-available platform across three or more nodes, with a separate
inference worker pool. Suitable for production workloads up to ~1,000
concurrent users and several models served at once.Platform nodes
1.5–3 TiB aggregate (the per-node NVMe above) for databases, queue, and working set
Analytics & observability
Separate volumes sized to your retention window (grows with traffic) — see Storage planning
Model storage (registry)
Separate volume sized to your model catalog — see Storage planning
Inference worker nodes (separate pool)
Resource
Recommended
Accelerators
GPU, CPU, and/or HPU — chosen per model and SLO
Capacity
Sized to the models you serve; the platform recommends node type and count per model
Memory per GPU node
≥ 2× the node’s total GPU memory
Model volume
Shared ReadWriteMany storage class (NFS, AWS EFS, Azure Files, etc.) sized to the models the pool serves — one copy for the whole pool, not per node (see Storage planning)
Networking
10 Gbps between nodes
High availability comes from running three copies of the platform’s
stateful services, which is the main increase over the single-node tier.
Inference nodes can be a mix of types — for example, GPU nodes for
latency-critical models alongside CPU nodes for smaller workloads.
A multi-tenant deployment for 10,000+ users. The platform layer grows
modestly; inference capacity and data retention dominate.Platform
Resource
Recommended
Nodes
~6
Total CPU
96–128 vCPU
Total memory
384–512 GiB
Inference capacity
Resource
Recommended
Inference nodes
Many, multi-node — a mix of GPU/CPU/HPU sized to your model catalog, traffic, and SLOs
Storage
Resource
Recommended
Platform + analytics
10–20 TiB+ — grows with request volume and retention
Model storage (registry)
Sized to your model catalog, separate from the above — see Storage planning
At this scale both the model registry and request-analytics/usage data
are major storage drivers. Size each independently — see Storage
planning below.
Storage falls into three independent components. They scale on completely
different axes, so size each one separately rather than picking a single total:
Independent of traffic and of your model catalog, a deployment provisions
storage for its databases, message queue, and object-store metadata. Plan for
roughly 200–500 GiB for this layer before any model files or request data.
The model registry — where downloaded model weights live — is usually the
largest and least predictable part of total storage, and it scales with
your model catalog, not with the deployment tier. Size it explicitly.Per model, registry size ≈ parameters × bytes-per-parameter × variants kept:
Precision
Bytes/param
8B
70B
405B
bf16 / fp16
2
~16 GB
~140 GB
~810 GB
fp8 / int8
1
~8 GB
~70 GB
~405 GB
int4 (quantized)
0.5
~4 GB
~35 GB
~200 GB
Bud often keeps more than one variant of a model — for example the original
weights plus a quantized copy — so multiply by the number of variants you retain.Example registry sizes. Most catalogs fall into a few tiers. Use these as a
starting point, then refine with the formula above for your exact model list:
Registry tier
Example catalog
Approx. catalog size
Suggested registry capacity
Evaluation
One 8B chat model (bf16) + a small embedding model
~20 GB
50 GiB
Small catalog
A few 7–32B models, one of them quantized, + embeddings and a reranker
~50–100 GB
200 GiB
Production
A 70B model kept as both bf16 and int4, + 8–32B models and embeddings
~250–500 GB
500 GiB – 1 TiB
Large / multi-tenant
A quantized 405B model + several 70B/32B variants + a multimodal model + embeddings
~1–3 TB
2–5 TiB+
The suggested registry capacity column is what to provision for the registry
volume and set as MODEL_REGISTRY_MAX_SIZE; it adds headroom above the raw
catalog size for variants you add later and for the local download cache.Model weights occupy storage in up to three places; budget for all of them:
Registry (durable copy) — one copy per model variant in the object store
(S3 / MinIO / rustfs / Ceph). This is what the registry budget,
MODEL_REGISTRY_MAX_SIZE, governs.
Local download cache — staging on the model-registry volume while a model
is fetched, before it is uploaded to the registry. Provision at least your
largest model × the number of concurrent downloads.
Inference model volume — when a model is deployed, its weights are placed
on a volume the inference pool mounts. Use a shared (ReadWriteMany) storage
class — NFS, AWS EFS, Azure Files, or similar — so the pool keeps one copy
of each model regardless of node count; size it to the models that pool serves.
(A node-local ReadWriteOnce volume works for single-node serving, but then each
node needs its own copy.) Note the trade-off: shared network-attached storage
(NFS, AWS EFS) saves space but can slow model load times at pod startup/scale-up
compared with node-local NVMe — back it with fast storage, or use a node-local
cache for latency-sensitive cold starts.
The registry runs a pre-flight capacity check before every download, so an
over-full registry fails fast instead of part-way through a multi-gigabyte
upload. Set MODEL_REGISTRY_MAX_SIZE to your provisioned registry capacity and
grow the backing volume as your catalog grows — see
Helm Configuration.
The default deployment stores per-request metadata only (not full prompt and
response text), which keeps this small. As a guide, at a sustained 1,000
requests/second:
Configuration
Steady-state (3 HA copies)
Metadata only (default)
~6 TB, plateaus at the retention window
With full request/response logging enabled
~19 TB, plateaus at the retention window
Scale linearly for other rates — e.g. 100 requests/second is roughly one tenth.
Enabling full request/response content logging substantially increases storage.
If you turn it on, set an appropriate retention window first and provision
accordingly.
Prune unused models and stale quantized variants from the registry — for most
deployments the model catalog, not request volume, is the largest storage driver.
Tune the observability retention window to your needs (shorter = less storage).
Keep full request/response logging off unless you need it, and bound it with a
retention window when you do.
Use premium SSD/NVMe for databases and the model registry; standard SSD is
fine for general application data.
For very large datasets, scale the analytics database horizontally rather
than relying on replication alone.
Add inference nodes to increase serving capacity. Choose the accelerator —
GPU, CPU, or HPU — based on each model and its SLO; the platform recommends
the node type and configuration at deployment time.
Scale out by adding model replicas across inference nodes as concurrency grows.
Keep inference nodes in a separate node pool from the platform services so
the two scale independently, and mix node types to match each model.