> ## Documentation Index
> Fetch the complete documentation index at: https://docs.budecosystem.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Process Document

> Run OCR / document extraction and export to multiple structured formats.

The `/v1/documents` endpoint extracts content from a document and returns it in the formats you
request. It handles **PDFs and images** (via a vision model), **Office documents** (DOCX, PPTX),
and **spreadsheets** (XLSX, XLSM, CSV). Request several formats in a single call — each is returned
under its own `*_content` field.

Large inputs are handled with bounded memory: large spreadsheets are **streamed** row‑by‑row, and
large PDFs are processed in **page‑range chunks**, so the service returns a result (or a clean
error) instead of failing. For very large or slow documents, use the [asynchronous job
path](#large-or-slow-documents).

<RequestExample>
  ```bash PDF (URL) theme={null}
  curl https://gateway.bud.studio/v1/documents \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "document-processor",
      "document": {
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2408.09869"
      },
      "task": "extract",
      "options": {
        "to_formats": ["doctags", "md"]
      }
    }'
  ```

  ```bash Spreadsheet (base64 data URI) theme={null}
  curl https://gateway.bud.studio/v1/documents \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "document-processor",
      "document": {
        "type": "document_url",
        "document_url": "data:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;base64,UEsDBB..."
      },
      "options": { "to_formats": ["csv", "md"] }
    }'
  ```

  ```bash Image (base64 data URI) theme={null}
  curl https://gateway.bud.studio/v1/documents \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "document-processor",
      "document": {
        "type": "image_url",
        "image_url": "data:image/png;base64,iVBORw0KGgo..."
      },
      "options": { "to_formats": ["md"] }
    }'
  ```
</RequestExample>

<ResponseExample>
  ```json 200 theme={null}
  {
    "id": "doc_019f12e9-b26c-7f23-b352-fe572ae08f49",
    "object": "document",
    "created": 1699000000,
    "model": "document-processor",
    "document_id": "fb1b390e-9459-4c87-9e45-eb757fd1fa99",
    "pages": [
      { "page_number": 1, "markdown": "# Document title\n\nExtracted content..." }
    ],
    "usage_info": {
      "pages_processed": 9,
      "size_bytes": 5566575,
      "filename": "2408.09869"
    },
    "document": {
      "doctags_content": "<doctag>...</doctag>",
      "md_content": "# Document title\n\nExtracted content...",
      "warnings": []
    }
  }
  ```
</ResponseExample>

<Note>
  Only the formats you request in `options.to_formats` appear in the `document` object; the others
  are omitted. The example above requested `["doctags", "md"]`, so only `doctags_content` and
  `md_content` are returned.
</Note>

## Headers

| Parameter     | Type   | Required | Description                  |
| ------------- | ------ | -------- | ---------------------------- |
| Authorization | string | Yes      | Bearer authentication header |
| Content-Type  | string | Yes      | `application/json`           |

## Body

| Parameter | Type   | Required | Description                                                                                 |
| --------- | ------ | -------- | ------------------------------------------------------------------------------------------- |
| model     | string | Yes      | A document‑capable model (must be backed by the document provider — see [Models](#models)). |
| document  | object | Yes      | The document to process. See [Document input](#document-input).                             |
| task      | string | No       | Processing task. Currently `extract`. Default: `extract`.                                   |
| prompt    | string | No       | Optional override for the prompt sent to the vision model (PDF/image only).                 |
| options   | object | No       | Export and processing options. See [Options](#options).                                     |

### Document input

The `document` object selects the source by `type`:

| Field         | Type   | Description                                                         |
| ------------- | ------ | ------------------------------------------------------------------- |
| type          | string | `document_url` (PDF / Office / spreadsheet) or `image_url` (image). |
| document\_url | string | URL or base64 data URI. Required when `type` is `document_url`.     |
| image\_url    | string | URL or base64 data URI. Required when `type` is `image_url`.        |

The document format is detected from content (and, for base64, the MIME type / extension), so a
DOCX/XLSX/CSV sent as `document_url` is routed correctly without extra configuration.

### Options

| Field                 | Type      | Default   | Description                                                                                                                                                                                                              |
| --------------------- | --------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| to\_formats           | string\[] | `["md"]`  | Output formats to export. Allowed: `md`, `json`, `html`, `text`, `doctags`, `csv`, `jsonl`. `markdown` is accepted as an alias for `md`. Values are case‑insensitive. `csv`/`jsonl` are populated only for spreadsheets. |
| vlm\_response\_format | string    | `doctags` | Advanced: the intermediate format the vision model is asked for (`doctags` for structured assembly, or `markdown` for raw passthrough). PDF/image only.                                                                  |

## Output formats

Each requested `to_formats` value maps to a field in the response `document` object:

| `to_formats` value   | Response field    | Contents                                                                            |
| -------------------- | ----------------- | ----------------------------------------------------------------------------------- |
| `md` (or `markdown`) | `md_content`      | Markdown export of the assembled document.                                          |
| `json`               | `json_content`    | JSON‑encoded string of the structured document (texts, tables, pictures, layout).   |
| `html`               | `html_content`    | HTML export.                                                                        |
| `text`               | `text_content`    | Plain‑text export.                                                                  |
| `doctags`            | `doctags_content` | [DocTags](https://github.com/docling-project/docling) structured markup.            |
| `csv`                | `csv_content`     | CSV export (spreadsheets only).                                                     |
| `jsonl`              | `jsonl_content`   | JSON‑Lines export, one JSON object per row keyed by the header (spreadsheets only). |

<Note>
  Structured output (real DocTags, structured HTML/JSON) requires a **DocTags‑capable** vision model
  (e.g. granite‑docling / SmolDocling) for PDFs/images. With a general vision model that returns prose,
  the service falls back to populating each requested format from the raw extracted text — so the call
  still succeeds, but `json_content`/`html_content`/`doctags_content` wrap the raw text rather than
  carrying fully structured output.
</Note>

## Spreadsheets

XLSX, XLSM, and CSV inputs are extracted directly (no vision model needed). Small spreadsheets are
parsed with full structure; **large spreadsheets are streamed** with bounded memory and are best
requested as `csv`, `jsonl`, `text`, or `md`. When streaming, some information is necessarily lossy
and is reported in `document.warnings` (a list of strings), for example:

* merged cells are flattened (only the top‑left cell of a merged range keeps its value);
* embedded images are dropped (cell data only);
* formula cells emit the formula text when cached values are unavailable.

Very wide or very large sheets are truncated to a bounded size; when the streamed output would
exceed the inline limit, the request returns `413` — request fewer formats (e.g. `csv` only) or a
narrower sheet.

## Response

| Field        | Type      | Description                                                                                                                                                                                           |
| ------------ | --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| id           | string    | Response id, prefixed `doc_`.                                                                                                                                                                         |
| object       | string    | Always `document`.                                                                                                                                                                                    |
| created      | integer   | Unix timestamp (seconds).                                                                                                                                                                             |
| model        | string    | The model used.                                                                                                                                                                                       |
| document\_id | string    | Unique id for the processed document.                                                                                                                                                                 |
| pages        | object\[] | Per‑page results: `{ page_number, markdown }` (raw per‑page model output).                                                                                                                            |
| usage\_info  | object    | `{ pages_processed, size_bytes, filename }`.                                                                                                                                                          |
| document     | object    | Per‑format exports (`md_content`, `json_content`, `html_content`, `text_content`, `doctags_content`, `csv_content`, `jsonl_content`) plus `warnings` (string\[]). Only requested formats are present. |

## Size limits & large documents

* **base64 data URIs are capped** (default 25 MB decoded). Larger payloads return `413` — pass the
  document as a `document_url` (fetched and streamed server‑side, up to the overall size limit,
  default 100 MB) or use the [multipart upload](#large-or-slow-documents).
* **Large spreadsheets** stream automatically with bounded memory.
* **Large PDFs** are converted in page‑range chunks so memory stays bounded regardless of page
  count. Because each page still requires a vision‑model call, a big PDF can take minutes — for
  those, prefer the asynchronous job path below so the request does not hold the connection open.

## Errors

The endpoint returns the underlying failure's real status code:

| Status | Meaning                                                                                                                                                                      |
| ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 400    | Bad request — the document could not be fetched or parsed, or the vision model produced no content.                                                                          |
| 401    | Missing or invalid API key.                                                                                                                                                  |
| 404    | The requested `model` is unknown or not document‑capable.                                                                                                                    |
| 413    | Payload too large — base64 over the cap, file over the size limit, PDF over the page limit, or a JSON / streamed‑spreadsheet export that would exceed the inline size limit. |
| 422    | Unprocessable — an unsupported format (e.g. legacy `.xls`), a corrupt/truncated file, a document that exceeds the content budget, or processing that **timed out**.          |

```json 400 theme={null}
{
  "error": {
    "message": "BudDoc returned error status 400 Bad Request: {\"detail\":\"Failed to fetch document from URL: ...\"}"
  }
}
```

```json 413 theme={null}
{
  "error": {
    "message": "BudDoc returned error status 413 Payload Too Large: {\"detail\":\"base64 payload exceeds the 25 MB limit for inline data URIs; upload larger files via POST /documents/ocr/upload (multipart) or pass a document_url\"}"
  }
}
```

## Large or slow documents

For files too large for base64 or documents that take too long to process synchronously (e.g. big
multi‑hundred‑page PDFs), the document service exposes an **asynchronous job path** and a
**streamed multipart upload**. These are served by the document service; a job is processed in the
background with bounded memory (few pages/rows rendered at a time) and you poll for the result.

<Note>
  These endpoints are served by the document service. They are not currently proxied through the
  public `/v1/documents` gateway route — call them on the document service base URL.
</Note>

### Submit a job

* `POST /documents/ocr/jobs` — same JSON body as `/v1/documents` (base64 or `document_url`).
* `POST /documents/ocr/jobs/upload` — `multipart/form-data` (`file`, `model`, `to_formats`,
  `prompt`), which streams the upload to disk and never buffers it in memory.

Both return `202 Accepted`:

```json 202 theme={null}
{ "job_id": "44a9d3e4-6b28-4902-941e-81fba96aada0", "status": "pending", "poll_url": "/documents/ocr/jobs/44a9d3e4-6b28-4902-941e-81fba96aada0" }
```

```bash multipart submit theme={null}
curl -X POST https://<document-service>/documents/ocr/jobs/upload \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@report.pdf" \
  -F "model=document-processor" \
  -F "to_formats=md"
```

Intake validation (size caps, SSRF, format sniffing) runs synchronously at submit time, so an
oversized or unsupported input is rejected immediately (`413`/`422`) and never enqueued.

### Poll for the result

`GET /documents/ocr/jobs/{job_id}`:

* `200` with `status` = `pending` / `running` while processing (no `result` yet).
* `200` with `status` = `completed` and the `result` (same shape as the synchronous response).
* If the job **failed**, the poll returns the failure's real HTTP status (e.g. `413`, `422`, `500`).
* `404` if the `job_id` is unknown or expired.

`DELETE /documents/ocr/jobs/{job_id}` removes a job record (and any spilled result). There is also a
synchronous `POST /documents/ocr/upload` (multipart) for a streamed upload that returns the result
directly, subject to the same processing‑time considerations as `/v1/documents`.

## Models

`/v1/documents` is served by the document provider (powered by [docling](https://github.com/docling-project/docling)
and a vision model), not by chat providers. The `model` you pass must be registered as
**document‑capable** and routed to the document backend. A vision/multimodal model (image input +
text output) is required for PDFs and images; a **DocTags model** (granite‑docling / SmolDocling) is
recommended for genuinely structured `doctags`/`html`/`json` output. Spreadsheets and CSVs are
extracted without a vision model.
