> ## Documentation Index
> Fetch the complete documentation index at: https://docs.budecosystem.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Hardware Requirements

> Plan compute, memory, storage, and inference (GPU/CPU/HPU) capacity for a self-hosted Bud deployment

## Overview

A self-hosted Bud deployment has two resource layers, and sizing each one
separately is the key to right-sizing your infrastructure:

<CardGroup cols={2}>
  <Card title="Platform services" icon="layer-group">
    The control plane — dashboard, APIs, gateway, databases, queue, object
    storage, and monitoring. **Memory is a largely fixed baseline; CPU scales
    with request throughput** and is the binding constraint for high-throughput
    agent/event traffic — load testing runs out of CPU well before memory.
  </Card>

  <Card title="Inference nodes" icon="microchip">
    Where your models actually run. **The dominant cost at scale.** Provisioned
    on dedicated inference nodes whose hardware — **GPU, CPU, or HPU** — is chosen
    per model based on the model, use case, and latency/throughput (SLO) targets.
    The platform recommends the node type and configuration when you deploy each
    model.
  </Card>
</CardGroup>

<Note>
  Plan the platform services first (a relatively fixed footprint), then add
  inference nodes of the appropriate type — CPU, GPU, or HPU — for the models you
  intend to serve. The two layers scale independently.
</Note>

### Choosing inference hardware

You add inference capacity by attaching nodes with the accelerator that best
fits each model and its service-level objectives. The platform's optimizer
analyses the model and target SLO and recommends the node type and count.

| Hardware              | Typical fit                                                                           |
| --------------------- | ------------------------------------------------------------------------------------- |
| **GPU (CUDA)**        | Largest models, lowest latency, highest concurrency                                   |
| **CPU**               | Smaller / quantized models, batch or latency-tolerant workloads, cost-sensitive sites |
| **HPU (Intel Gaudi)** | High-throughput serving where Gaudi accelerators are available                        |

A single deployment can mix node types — for example, GPU nodes for
latency-critical models and CPU nodes for smaller or background workloads.

## Choosing a deployment size

<Note>
  The tiers below are sized by **concurrent requests** (requests in flight at once),
  not named users. Convert to/from throughput with
  **concurrent ≈ requests/sec × average request latency** — e.g. 1,000 req/s of a
  10-second agent turn is \~10,000 concurrent. Holding concurrency is **cheap on
  memory** (\~1–2 MiB per in-flight request on the platform, measured on the event
  path); **CPU is driven by throughput (req/s)** and is the resource that runs out
  first for high-throughput agent/event workloads. Size CPU to your peak req/s, not
  to a user count.
</Note>

<Tabs>
  <Tab title="Single Node (AI-in-a-Box)">
    One machine running the full platform plus one or more models locally.
    Suitable for evaluations, edge sites, and small teams (up to \~100 concurrent
    requests).

    | Resource              | Minimum                               | Recommended                                               |
    | --------------------- | ------------------------------------- | --------------------------------------------------------- |
    | CPU                   | 16 cores                              | 32 cores                                                  |
    | Memory                | 96 GiB                                | 128 GiB                                                   |
    | Storage (NVMe SSD)    | 1 TiB                                 | 2 TiB                                                     |
    | Inference accelerator | CPU-only (smaller / quantized models) | 1 × 48–80 GB GPU for larger models or low-latency serving |

    <Warning>
      The platform services alone use roughly 50–60 GiB of memory, so **64 GiB is
      not enough** once a model and the operating system are added. Likewise, the
      fixed storage footprint already exceeds 200 GiB before model files — start at
      1 TiB.
    </Warning>

    Smaller or quantized models can serve on CPU. A single 48–80 GB GPU
    comfortably serves one model in the \~20–30B-parameter range at low latency.
    Larger models or higher concurrency call for the clustered tier.
  </Tab>

  <Tab title="Clustered (High Availability)">
    A highly-available platform across three or more nodes, with a separate
    inference worker pool. Suitable for production workloads up to \~1,000
    concurrent requests and several models served at once.

    **Platform nodes**

    | Resource                      | Recommended                                                                                                      |
    | ----------------------------- | ---------------------------------------------------------------------------------------------------------------- |
    | Node shape                    | 3 × (24–48 vCPU / 64–128 GiB / 500 GiB–1 TiB NVMe)                                                               |
    | Total CPU                     | 72–144 vCPU — **CPU leads** for agent/event throughput; prefer the upper end and add headroom for bursts         |
    | Total memory                  | 192–384 GiB (memory validated as ample at this tier)                                                             |
    | Platform storage (node-local) | 1.5–3 TiB aggregate (the per-node NVMe above) for databases, queue, and working set                              |
    | Analytics & observability     | Separate volumes sized to your retention window (grows with traffic) — see [Storage planning](#storage-planning) |
    | Model storage (registry)      | Separate volume sized to your model catalog — see [Storage planning](#storage-planning)                          |

    **Inference worker nodes** (separate pool)

    | Resource            | Recommended                                                                                                                                                                                          |
    | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | Accelerators        | GPU, CPU, and/or HPU — chosen per model and SLO                                                                                                                                                      |
    | Capacity            | Sized to the models you serve; the platform recommends node type and count per model                                                                                                                 |
    | Memory per GPU node | ≥ 2× the node's total GPU memory                                                                                                                                                                     |
    | Model volume        | Shared **ReadWriteMany** storage class (NFS, AWS EFS, Azure Files, etc.) sized to the models the pool serves — one copy for the whole pool, not per node (see [Storage planning](#storage-planning)) |
    | Networking          | 10 Gbps between nodes                                                                                                                                                                                |

    High availability comes from running three copies of the platform's
    stateful services, which is the main increase over the single-node tier.
    Inference nodes can be a mix of types — for example, GPU nodes for
    latency-critical models alongside CPU nodes for smaller workloads.
  </Tab>

  <Tab title="Large-Scale (Multi-Tenant)">
    A multi-tenant deployment for 10,000+ concurrent requests. The platform
    layer's **CPU** grows with throughput (memory stays modest); inference
    capacity and data retention dominate.

    **Platform**

    | Resource     | Recommended                                                                                                                                              |
    | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | Nodes        | \~6–8                                                                                                                                                    |
    | Total CPU    | 128–192 vCPU — the **binding resource** at this scale; size to peak req/s (see the note under [Choosing a deployment size](#choosing-a-deployment-size)) |
    | Total memory | 384–512 GiB (rarely the limit for the platform layer)                                                                                                    |

    **Inference capacity**

    | Resource        | Recommended                                                                            |
    | --------------- | -------------------------------------------------------------------------------------- |
    | Inference nodes | Many, multi-node — a mix of GPU/CPU/HPU sized to your model catalog, traffic, and SLOs |

    **Storage**

    | Resource                 | Recommended                                                                                      |
    | ------------------------ | ------------------------------------------------------------------------------------------------ |
    | Platform + analytics     | 10–20 TiB+ — grows with request volume and retention                                             |
    | Model storage (registry) | Sized to your model catalog, separate from the above — see [Storage planning](#storage-planning) |

    At this scale **both** the model registry and request-analytics/usage data
    are major storage drivers. Size each independently — see [Storage
    planning](#storage-planning) below.
  </Tab>
</Tabs>

## Storage planning

Storage falls into **three independent components**. They scale on completely
different axes, so size each one separately rather than picking a single total:

| Component                 | Scales with                                               | Bounded?                               |
| ------------------------- | --------------------------------------------------------- | -------------------------------------- |
| Platform baseline         | Fixed footprint — databases, queue, object-store metadata | Yes, roughly fixed                     |
| Model storage             | Number and size of the models you onboard                 | No — grows with your catalog           |
| Analytics & observability | Request rate × retention window                           | Yes — plateaus at the retention window |

### Platform baseline

Independent of traffic and of your model catalog, a deployment provisions
storage for its databases, message queue, and object-store metadata. Plan for
roughly **200–500 GiB** for this layer before any model files or request data.

### Model storage

The model registry — where downloaded model weights live — is usually the
**largest and least predictable** part of total storage, and it scales with
your **model catalog, not with the deployment tier**. Size it explicitly.

Per model, registry size ≈ **parameters × bytes-per-parameter × variants kept**:

| Precision        | Bytes/param | 8B      | 70B      | 405B     |
| ---------------- | ----------- | ------- | -------- | -------- |
| bf16 / fp16      | 2           | \~16 GB | \~140 GB | \~810 GB |
| fp8 / int8       | 1           | \~8 GB  | \~70 GB  | \~405 GB |
| int4 (quantized) | 0.5         | \~4 GB  | \~35 GB  | \~200 GB |

Bud often keeps **more than one variant** of a model — for example the original
weights plus a quantized copy — so multiply by the number of variants you retain.

**Example registry sizes.** Most catalogs fall into a few tiers. Use these as a
starting point, then refine with the formula above for your exact model list:

| Registry tier        | Example catalog                                                                     | Approx. catalog size | Suggested registry capacity |
| -------------------- | ----------------------------------------------------------------------------------- | -------------------- | --------------------------- |
| Evaluation           | One 8B chat model (bf16) + a small embedding model                                  | \~20 GB              | 50 GiB                      |
| Small catalog        | A few 7–32B models, one of them quantized, + embeddings and a reranker              | \~50–100 GB          | 200 GiB                     |
| Production           | A 70B model kept as **both** bf16 and int4, + 8–32B models and embeddings           | \~250–500 GB         | 500 GiB – 1 TiB             |
| Large / multi-tenant | A quantized 405B model + several 70B/32B variants + a multimodal model + embeddings | \~1–3 TB             | 2–5 TiB+                    |

The **suggested registry capacity** column is what to provision for the registry
volume and set as `MODEL_REGISTRY_MAX_SIZE`; it adds headroom above the raw
catalog size for variants you add later and for the local download cache.

Model weights occupy storage in up to **three places**; budget for all of them:

* **Registry (durable copy)** — one copy per model variant in the object store
  (S3 / MinIO / rustfs / Ceph). This is what the registry budget,
  `MODEL_REGISTRY_MAX_SIZE`, governs.
* **Local download cache** — staging on the model-registry volume while a model
  is fetched, before it is uploaded to the registry. Provision at least your
  largest model × the number of concurrent downloads.
* **Inference model volume** — when a model is deployed, its weights are placed
  on a volume the inference pool mounts. Use a **shared (ReadWriteMany) storage
  class** — NFS, AWS EFS, Azure Files, or similar — so the pool keeps **one copy**
  of each model regardless of node count; size it to the models that pool serves.
  (A node-local ReadWriteOnce volume works for single-node serving, but then each
  node needs its own copy.) Note the trade-off: shared network-attached storage
  (NFS, AWS EFS) saves space but can slow model load times at pod startup/scale-up
  compared with node-local NVMe — back it with fast storage, or use a node-local
  cache for latency-sensitive cold starts.

<Note>
  The registry runs a **pre-flight capacity check** before every download, so an
  over-full registry fails fast instead of part-way through a multi-gigabyte
  upload. Set `MODEL_REGISTRY_MAX_SIZE` to your provisioned registry capacity and
  grow the backing volume as your catalog grows — see
  [Helm Configuration](/developer-docs/helm-configuration).
</Note>

### Analytics & observability growth

Usage data grows with traffic but is **bounded by retention windows**, so it
reaches a steady state rather than growing forever:

| Data                | What it is                                           | Default retention      |
| ------------------- | ---------------------------------------------------- | ---------------------- |
| Inference analytics | Per-request metadata: tokens, latency, model, status | 90 days                |
| Observability       | Traces, logs, and metrics across the platform        | 30 days (configurable) |
| Usage metrics       | Aggregated dashboards and billing rollups            | 90 days                |

Estimate the growth component with:

```
steady-state size ≈ bytes-per-request × requests-per-second × 86,400 × retention-days × replication
```

The default deployment stores **per-request metadata only** (not full prompt and
response text), which keeps this small. As a guide, at a sustained **1,000
requests/second**:

| Configuration                              | Steady-state (3 HA copies)                |
| ------------------------------------------ | ----------------------------------------- |
| Metadata only (default)                    | \~6 TB, plateaus at the retention window  |
| With full request/response logging enabled | \~19 TB, plateaus at the retention window |

Scale linearly for other rates — e.g. 100 requests/second is roughly one tenth.

<Warning>
  Enabling full request/response content logging substantially increases storage.
  If you turn it on, set an appropriate retention window first and provision
  accordingly.
</Warning>

### Keeping storage predictable

* Prune unused models and stale quantized variants from the registry — for most
  deployments the model catalog, not request volume, is the largest storage driver.
* Tune the observability retention window to your needs (shorter = less storage).
* Keep full request/response logging off unless you need it, and bound it with a
  retention window when you do.
* Use premium SSD/NVMe for databases and the model registry; standard SSD is
  fine for general application data.
* For very large datasets, scale the analytics database **horizontally** rather
  than relying on replication alone.

## Scaling

### Platform services

* **Memory** is a largely fixed baseline (databases, queue, monitoring) plus a
  small per-request working set (\~1–2 MiB per concurrent request, measured on the
  event path) — provision it generously, but it rarely leads.
* **CPU scales with request throughput and is usually the binding constraint** for
  high-throughput agent/event paths — size it to peak req/s with headroom, not as
  an afterthought. Every service pod also runs a **Dapr sidecar**, so per-pod CPU
  (and the total across many replicas) is higher than the app container alone.
* Stateless services support **horizontal autoscaling** (enable it per service for
  high availability and burst handling). The public event edge (budevent) holds one
  in-flight turn per concurrent request (\~1 MiB each) and caps per pod, so roughly
  **\~10,000 concurrent needs \~20 edge replicas** — scale the edge out alongside
  throughput, and keep a warm `minReplicas` floor so a sudden spike lands on enough
  pods before autoscaling catches up (\~60–90 s).
* Run **three copies** of stateful services for high availability in the
  clustered and large-scale tiers. Note the **shared databases** (Postgres
  connection limit; the analytics store) are a stateful ceiling that scaling
  *stateless* replicas does not lift — scale the datastore itself for high req/s.

### Inference nodes

* Add **inference nodes** to increase serving capacity. Choose the accelerator —
  **GPU, CPU, or HPU** — based on each model and its SLO; the platform recommends
  the node type and configuration at deployment time.
* Scale out by adding model replicas across inference nodes as concurrency grows.
* Keep inference nodes in a **separate node pool** from the platform services so
  the two scale independently, and mix node types to match each model.

### Networking

| Traffic                 | Minimum | Recommended                             |
| ----------------------- | ------- | --------------------------------------- |
| Between nodes           | 5 Gbps  | 10–40 Gbps (higher for inference pools) |
| Internet ingress/egress | 1 Gbps  | 5 Gbps                                  |

## Next steps

<CardGroup cols={2}>
  <Card title="Installation Guide" icon="download" href="/developer-docs/installation">
    Deploy the platform on Kubernetes
  </Card>

  <Card title="Helm Configuration" icon="gear" href="/developer-docs/helm-configuration">
    Configure resources, retention, and services
  </Card>

  <Card title="Deployment" icon="rocket" href="/developer-docs/deployment">
    Deployment options and workflows
  </Card>
</CardGroup>
