# Run Agent Manager on a VM with Docker

Not recommended for production use

Both installation paths on this page **Simple** and **Advanced** are intended for **evaluation, demos, and proof-of-concept** use only. Do not use them to run production workloads or to handle sensitive or regulated data.

For production, run Agent Manager on a properly operated Kubernetes platform with high availability, a managed and backed-up database, secret management, monitoring, and a hardened, redundant ingress following your organization's production practices. Use these installers to try Agent Manager out, not to run it for real.

Install Agent Manager on a Linux VM where Docker is the only host dependency. Pick the path that fits you:

* **Simple** — give the installer the VM's public IP and it does everything else: hostnames are derived from the IP via [sslip.io](https://sslip.io) and TLS certificates are issued automatically by Let's Encrypt. No domain, no DNS setup, no certificate handling. Best for demos and quick evaluations.
* **Advanced** — a config-file-driven installer for custom domains and operator-managed TLS: use your own **custom domain**, bring **your own certificates**, or front the VM with a **load balancer** that terminates TLS. Adds pre-flight validation of your config, certificates, and DNS.

- Simple (IP + automatic TLS)
- Advanced (custom domain / BYOC / load balancer)

The simple installer exposes the platform over HTTPS using [sslip.io](https://sslip.io) hostnames derived from the VM's public IP, so there's no domain registration and no client `/etc/hosts` edits.

## Prerequisites[​](#prerequisites "Direct link to Prerequisites")

You only need an SSH client to log into the VM; everything else runs on the VM.

* **Git** to clone this repository and fetch the installer. This is the one tool you need *before* running the script (the script can't install what you use to download it). Most images have it; on a minimal one install it with `sudo apt-get update && sudo apt-get install -y git` (Debian/Ubuntu).
* **Docker** is required — the whole stack runs on it (k3d runs the Kubernetes cluster as Docker containers, and Caddy runs as a container). If Docker isn't already installed, the script installs it for you, along with k3d, kubectl, helm, and lsof.
* A Linux VM with a **static (reserved) public IP** and SSH access (sudo). The install derives every hostname, TLS certificate, and OAuth issuer from the IP (`*.amp.<IP>.sslip.io`), so a **changing IP breaks the install** — and stopping the VM (for example to resize its disk) releases an ephemeral IP. Reserve the address before installing. If the IP ever changes, reinstall against the new IP.
* **At least 50 GB of disk.** Building and running agents pushes the in-cluster image store past 13 GB; on a smaller disk the node hits `DiskPressure`, which evicts pods and can take cluster DNS down mid-build.
* **At least 4 vCPUs and 8 GB of RAM** to run the full k3d + OpenChoreo + Agent Manager stack comfortably.
* **Inbound `443/tcp` open** in the cloud security group / firewall — and only 443. Certificates issue via the TLS-ALPN-01 ACME challenge, which runs inside the `:443` TLS handshake, so no inbound port 80 is ever needed. The `:443` exposure must be **TCP passthrough** (not a TLS-terminating load balancer in front), since the challenge happens inside the handshake.

## Install[​](#install "Direct link to Install")

SSH into the VM, get the installer, and run it with `sudo`:

```
# on the VM
git clone https://github.com/wso2/agent-manager.git
cd agent-manager/deployments/vm
git checkout tags/amp/v0.16.0

sudo ./install-vm.sh \
  --host <VM_PUBLIC_IP> \
  --version 0.16.0 \
  --email you@example.com
```

Pass `--host` the VM's **public** IPv4 address — a cloud VM usually can't read its own public IP (it's NAT'd behind the address you used to SSH in), so the installer needs it to build the `*.amp.<IP>.sslip.io` hostnames.

The installer runs in two phases — bootstrap (Docker + tools + firewall) and the platform install + Caddy startup. Allow 8–15 minutes. It needs `sudo` because it installs Docker, opens the firewall, and creates the cluster.

### Options[​](#options "Direct link to Options")

| Flag                     | Default      | Purpose                                                                            |
| ------------------------ | ------------ | ---------------------------------------------------------------------------------- |
| `--host`                 | *(required)* | The VM's public IPv4 address                                                       |
| `--version`              | *(required)* | Agent Manager release to install — use the same `amp/v*` tag you checked out above |
| `--email`                | *(none)*     | ACME contact for expiry notices                                                    |
| `--no-external-gateways` | off          | Drop the gateway control-plane endpoint if you won't connect external gateways     |

## What gets exposed[​](#what-gets-exposed "Direct link to What gets exposed")

The installer fronts the stack with [Caddy](https://caddyserver.com), an open-source web server that terminates TLS, obtains and renews Let's Encrypt certificates automatically, and reverse-proxies each public hostname to the right service. It runs as a single `amp-caddy` Docker container and is the only process listening on the internet-facing ports.

Only `:443` faces the internet; all other service ports are bound to the VM's loopback and reached only by Caddy.

Every public hostname resolves to the VM's IP (via sslip.io) and arrives at Caddy on `:443`; Caddy terminates TLS and reverse-proxies to the matching loopback port. Certificates are obtained over that same `:443` using the TLS-ALPN-01 challenge, so no inbound port 80 is needed. The deployed-agent wildcard gets its certificate on demand at first request.

| URL                                                | Purpose                                                                 |
| -------------------------------------------------- | ----------------------------------------------------------------------- |
| `https://console.amp.<IP>.sslip.io`                | Console UI                                                              |
| `https://api.amp.<IP>.sslip.io`                    | Agent Manager API (used by `amctl`)                                     |
| `https://thunder.amp.<IP>.sslip.io`                | Thunder OAuth (login)                                                   |
| `https://observer.amp.<IP>.sslip.io`               | Traces Observer                                                         |
| `https://gateway.amp.<IP>.sslip.io/otel`           | OTel trace ingest from deployed agents                                  |
| `https://<org>-<project>.agents.<IP>.sslip.io/...` | Deployed-agent invocation endpoints (one wildcard host per org/project) |
| `https://cp.amp.<IP>.sslip.io`                     | Gateway control plane — connect external gateways here (on by default)  |

## Log in[​](#log-in "Direct link to Log in")

Open `https://console.amp.<IP>.sslip.io` and sign in as the seeded Agent Manager admin user **`amp-admin`** (password **`amp-admin`**). This user holds the AMP `admin` role, which grants every application permission.

From the `0.16.0` release, role-based access control is enforced on the API (`rbacEnabled`), so the token must carry the right scopes. Note that Thunder's own system account (`admin` / `admin`, shown in the bootstrap logs) is **not** granted the Agent Manager application role — signing in with it lets you reach the console but every API call fails with `403 insufficient permissions`. Always use `amp-admin`.

## Deployed-agent invocation[​](#deployed-agent-invocation "Direct link to Deployed-agent invocation")

When you deploy an agent, its endpoint is published on a per-project host `<org>-<project>.agents.<IP>.sslip.io` and routed by Caddy to the OpenChoreo data-plane gateway. Because these hostnames are dynamic (a new one per org/project), Caddy issues their TLS certificates **on demand** at the first request (via the same ACME challenge as the fixed hosts), rather than up front. Invocations are authenticated with a user token that the gateway validates against the public Thunder issuer.

Because issuance is on demand and uses TLS-ALPN-01 (the challenge runs inside the `:443` handshake), the **very first request to a newly-deployed agent host can fail with a one-time certificate error** — most visibly `ERR_CERTIFICATE_TRANSPARENCY_REQUIRED` in Chrome. That first connection is consumed by Caddy answering the ACME challenge, so the browser briefly sees the challenge certificate instead of the real one. Issuance completes within a second or two; reload the page (or open it in a fresh tab) and it serves the trusted Let's Encrypt certificate. This only affects the first hit per new agent host — the certificate is then cached in the `amp-caddy-data` volume.

amp-api advertises each agent endpoint with the `https://` scheme (the installer sets `tlsEnabled` on the service), so the console — and any other caller — invokes it over TLS directly through the wildcard site.

## TLS[​](#tls "Direct link to TLS")

Caddy obtains and auto-renews trusted Let's Encrypt certificates on first start — no manual certificate steps. Issuance uses the **TLS-ALPN-01** challenge, which runs inside the `:443` TLS handshake, so only inbound 443 is ever required and there is no port-80 dependency. Certificates and the ACME account persist in the `amp-caddy-data` Docker volume, so restarts do not re-request them.

Because the challenge happens inside the TLS handshake, the public `:443` must reach Caddy as **raw TCP** — do not put a TLS-terminating load balancer in front of the VM. There is no `:80` listener, so plain `http://` URLs are not served (no automatic http→https redirect); always use the `https://` URLs the installer prints.

## Persistence and teardown[​](#persistence-and-teardown "Direct link to Persistence and teardown")

Application data (PostgreSQL), issued certificates, and the k3d cluster persist across Docker/host restarts via named volumes. To tear down completely, delete the cluster, then remove the Caddy front door and its volumes (which hold the issued certificates and ACME account):

```
cd agent-manager/deployments/quick-start
sudo ./uninstall.sh --delete-cluster                    # delete the k3d cluster (workloads + app data)
sudo docker rm -f amp-caddy                              # remove the Caddy front door
sudo docker volume rm amp-caddy-data amp-caddy-config    # drop the cached certs + ACME account
```

Use `sudo` — the installer runs Docker and k3d as root. Plain `./uninstall.sh` (without `--delete-cluster`) only removes the Helm releases and leaves the cluster running; `uninstall.sh` does not touch the Caddy container or its volumes, so remove those separately as shown.

## Connect an external gateway[​](#connect-an-external-gateway "Direct link to Connect an external gateway")

Agent Manager can drive external WSO2 AI gateways. The control-plane endpoint `https://cp.amp.<IP>.sslip.io` is exposed by default for this. In the console, open **Infrastructure → Gateways**, generate a registration token, and follow the generated commands — they point the gateway at `cp.amp.<IP>.sslip.io:443`, where it opens a control WebSocket and pulls its configuration. If you do not need external gateways, install with `--no-external-gateways` to drop this endpoint.

**Security:** the registration token grants a gateway your LLM-provider API keys and proxy credentials. Treat it as a secret, revoke/regenerate it from the Gateways page when a gateway is decommissioned, and optionally restrict `cp.amp...` to known gateway source IPs at the firewall.

## Troubleshooting[​](#troubleshooting "Direct link to Troubleshooting")

* **Certificates never issue / hosts unreachable from outside** — open inbound `:443` in your cloud security group / NACL, and make sure the public `:443` reaches the VM as **raw TCP**: a TLS-terminating load balancer in front breaks the TLS-ALPN-01 challenge. The installer can't verify external reachability from inside the VM, so this surfaces as Caddy failing to obtain certificates (`docker logs amp-caddy`).
* **Certificate not issued** — check `docker logs amp-caddy`. Let's Encrypt rate limits on sslip.io are high but not infinite; if hit, retry shortly.
* **Login redirect mismatch** — confirm you reached the console via its `console.amp.<IP>.sslip.io` URL, not the raw IP.
* **`403 insufficient permissions` on API calls** — you are signed in as Thunder's system `admin` account, which has no Agent Manager application role. Sign out and sign back in as `amp-admin` (see [Log in](#log-in)).
* **Certificate error on first agent invocation** (`ERR_CERTIFICATE_TRANSPARENCY_REQUIRED` or similar) — the per-agent certificate is issued on demand, and the first request races with that issuance. Reload the page after a second or two; it only happens once per new agent host (see [Deployed-agent invocation](#deployed-agent-invocation)).

The advanced installer (`install-advanced.sh`) is for deployments that need a real domain or operator-managed certificates. It is driven by a config file and supports three TLS modes. Like the simple installer, it runs **on the VM** with `sudo`.

Use the advanced installer when you want any of:

* a **custom domain** (e.g. `console.amp.mycompany.com`) instead of an IP-derived `sslip.io` name;
* **bring-your-own certificates** (BYOC) issued by a corporate CA or pulled from a secrets store, rather than Let's Encrypt;
* **TLS terminated upstream** by a cloud load balancer or corporate proxy that already owns the public certificate.

If none of those apply, prefer the **Simple** tab.

## Prerequisites[​](#adv-prerequisites "Direct link to Prerequisites")

The compute, disk, and tooling prerequisites are the same as the Simple tab:

* **Git** to clone this repository and fetch the installer. This is the one tool you need *before* running the script (the script can't install what you use to download it); on a minimal image install it with `sudo apt-get update && sudo apt-get install -y git` (Debian/Ubuntu).
* **Docker** — the whole stack runs on it. If it isn't already installed, the script installs it for you, along with k3d, kubectl, helm, lsof, and openssl.
* A **Linux VM** with at least **4 vCPUs**, **8 GB RAM**, and **50 GB of disk**, with SSH access (sudo).

In addition, the advanced installer needs:

* **Control of your own DNS** for the chosen domain. The installer derives all service hostnames from a single base domain (`DOMAIN_BASE`), so you create DNS records under that domain (see [DNS](#adv-dns)).
* **The right inbound port open**, depending on the TLS mode (see [TLS modes](#adv-tls-modes)): `443` for `letsencrypt` and `byoc`, or your chosen forward port for `upstream`.

## Configure[​](#adv-configure "Direct link to Configure")

Generate an annotated config template and edit it:

```
# on the VM
git clone https://github.com/wso2/agent-manager.git
cd agent-manager/deployments/vm
git checkout tags/amp/v0.16.0

./install-advanced.sh --init > amp-config.env
# edit amp-config.env
```

The config file is plain shell (sourced by the installer). The keys are:

| Key                                                                                    | Required    | Purpose                                                                                                                                                       |
| -------------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `AMP_VERSION`                                                                          | yes         | Agent Manager release to install — use the same `amp/v*` [tag](https://github.com/wso2/agent-manager/tags) you checked out above (`0.16.0`)                   |
| `DOMAIN_BASE`                                                                          | yes         | Base domain; service hosts are derived as `<svc>.<DOMAIN_BASE>`                                                                                               |
| `TLS_MODE`                                                                             | yes         | `letsencrypt`, `byoc`, or `upstream`                                                                                                                          |
| `ACME_EMAIL`                                                                           | letsencrypt | ACME contact for expiry notices                                                                                                                               |
| `TLS_CERT_FILE` / `TLS_KEY_FILE`                                                       | byoc        | Paths to the operator certificate and private key                                                                                                             |
| `UPSTREAM_LISTEN_PORT`                                                                 | upstream    | Plain-HTTP port Caddy listens on behind the LB (default `80`). Must not be a loopback-bound cluster port (3000/8080/9000/9098/9243/19080/22893); `80` is safe |
| `UPSTREAM_TRUSTED_PROXIES`                                                             | upstream    | Space-separated CIDRs of the LB whose `X-Forwarded-*` headers Caddy trusts (default `0.0.0.0/0`)                                                              |
| `EXTERNAL_GATEWAYS`                                                                    | no          | `true` (default) exposes the `cp` endpoint for external data-plane gateways                                                                                   |
| `HOST_CONSOLE`, `HOST_API`, `HOST_THUNDER`, `HOST_OBSERVER`, `HOST_GATEWAY`, `HOST_CP` | no          | Override an individual service hostname (default `<svc>.<DOMAIN_BASE>`)                                                                                       |
| `AGENTS_BASE`                                                                          | no          | Base for deployed-agent hostnames (default `agents.<DOMAIN_BASE>`)                                                                                            |

With `DOMAIN_BASE=amp.mycompany.com`, the derived hosts are `console.amp.mycompany.com`, `api.amp.mycompany.com`, `thunder.amp.mycompany.com`, `observer.amp.mycompany.com`, `gateway.amp.mycompany.com`, `cp.amp.mycompany.com`, and deployed agents at `<org>-<project>.agents.amp.mycompany.com`.

## TLS modes[​](#adv-tls-modes "Direct link to TLS modes")

In every mode the URLs published to browsers and clients are `https://` — that is what the user sees. Only how TLS is terminated differs.

| Mode          | How TLS is handled                                                                                                                | Inbound port to open                               | When to use                                                             |
| ------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- | ----------------------------------------------------------------------- |
| `letsencrypt` | Caddy obtains and renews trusted Let's Encrypt certificates automatically (TLS-ALPN-01, inside the `:443` handshake)              | `443` (raw TCP, no proxy in front)                 | You control DNS for the domain and want automatic certificates          |
| `byoc`        | Caddy serves your supplied certificate and key on `:443`; no ACME                                                                 | `443`                                              | Certificates come from a corporate CA or a secrets store                |
| `upstream`    | A cloud load balancer / proxy in front terminates TLS; Caddy listens plain-HTTP on `UPSTREAM_LISTEN_PORT` and only routes by Host | the LB's forward port (the LB owns `443` publicly) | You already run an edge load balancer that holds the public certificate |

### BYOC certificate requirements[​](#adv-byoc-san "Direct link to BYOC certificate requirements")

Deployed-agent endpoints live one DNS level deeper than the service hosts, at `<org>-<project>.<AGENTS_BASE>`. A standard `*.<DOMAIN_BASE>` wildcard does **not** cover that tier, and there is no ACME in `byoc` mode to issue per-host certificates on demand. So your single certificate must carry SANs covering **both** `*.<DOMAIN_BASE>` and `*.<AGENTS_BASE>`. The installer's pre-flight checks this and fails fast (naming the missing SAN) if it is absent, along with verifying the cert and key match and the cert is not expired.

For example, a cert request covering both tiers:

```
openssl req -x509 -newkey rsa:2048 -nodes -days 365 \
  -keyout privkey.pem -out fullchain.pem -subj "/CN=amp.mycompany.com" \
  -addext "subjectAltName=DNS:*.amp.mycompany.com,DNS:*.agents.amp.mycompany.com"
```

### Upstream (load-balancer) topology[​](#adv-upstream "Direct link to Upstream (load-balancer) topology")

In `upstream` mode the load balancer owns `:443` and the public certificate. Configure it to forward each derived hostname to the VM's `UPSTREAM_LISTEN_PORT` over plain HTTP, and to set the `X-Forwarded-Proto: https` header — Caddy trusts it so the backends still see the original `https` scheme. Because the LB fronts DNS, the installer's DNS check is advisory (not a hard failure) in this mode.

Because the listen port carries plain HTTP and Caddy trusts the forwarded scheme, lock down who can reach it: **restrict `UPSTREAM_LISTEN_PORT` to the load balancer at the firewall**, and set `UPSTREAM_TRUSTED_PROXIES` to the LB's source CIDRs so only the LB can set `X-Forwarded-*`. The default (`0.0.0.0/0`) trusts any source and relies solely on the firewall — fine if the port is firewalled to the LB, but scoping both is safer. For a GCP Application Load Balancer the source ranges are `130.211.0.0/22` and `35.191.0.0/16`, so:

```
UPSTREAM_TRUSTED_PROXIES="130.211.0.0/22 35.191.0.0/16"
```

## DNS[​](#adv-dns "Direct link to DNS")

For `letsencrypt` and `byoc`, point A records for every service host at the VM's public IP, plus a wildcard for the deployed-agent tier. Two wildcard records are the simplest:

```
*.amp.mycompany.com          A  <VM_PUBLIC_IP>   # covers console/api/thunder/observer/gateway/cp
*.agents.amp.mycompany.com   A  <VM_PUBLIC_IP>   # covers deployed agents (one level deeper)
```

The second record is separate because deployed-agent hostnames sit one level below the service hosts, and a `*.amp.mycompany.com` wildcard does not match `x.agents.amp.mycompany.com`. If you use a proxying DNS provider (for example Cloudflare's orange-cloud), set these records to **DNS-only** — a proxy that terminates TLS in front of the VM breaks the TLS-ALPN-01 challenge.

In `letsencrypt` mode these records must resolve to the VM **before** you run the installer — ACME issuance fails otherwise, and the installer's DNS pre-flight hard-fails with the exact records to create. (The check accepts the VM's public egress IP as well as its local interface IPs, so it works correctly on NAT'd cloud VMs.) In `upstream` mode, point DNS at the load balancer instead; the VM-side check is advisory.

## Install[​](#adv-install "Direct link to Install")

Validate and preview without touching the cluster first:

```
sudo ./install-advanced.sh --config amp-config.env --dry-run
```

This loads the config, runs the cert and (in `letsencrypt`) DNS pre-flight, and prints the derived hosts, helm overrides, and the rendered Caddyfile. When it looks right, run the real install:

```
sudo ./install-advanced.sh --config amp-config.env
```

It runs in two phases — bootstrap (Docker + tools + firewall) and the platform install + Caddy startup — and takes 8–15 minutes. It needs `sudo` because it installs Docker, opens the firewall, and creates the cluster. On completion it prints the access URLs.

## Persistence and teardown[​](#adv-persistence "Direct link to Persistence and teardown")

Application data (PostgreSQL), issued certificates, and the k3d cluster persist across Docker/host restarts via named volumes. In `letsencrypt` mode the `amp-caddy-data` volume caches issued certificates and the ACME account, so restarts do not re-request them. To tear down completely:

```
cd agent-manager/deployments/quick-start
sudo ./uninstall.sh --delete-cluster                    # delete the k3d cluster (workloads + app data)
sudo docker rm -f amp-caddy                              # remove the Caddy front door
sudo docker volume rm amp-caddy-data amp-caddy-config    # drop the cached certs + ACME account
```

Use `sudo` (Docker and k3d run as root). Plain `./uninstall.sh` without `--delete-cluster` only removes the Helm releases and leaves the cluster running; `uninstall.sh` does not touch the Caddy container or its volumes, so remove those separately as shown.

**Changing the domain or hostnames after install requires a teardown first.** The platform install is idempotent in the "create if missing" sense — on a re-run it skips releases that already exist, so editing `DOMAIN_BASE` (or the `HOST_*` overrides) and re-running does **not** reconfigure the already-installed services; only Caddy's front-door TLS changes, leaving the apps advertising the old hostnames. To move an existing install to a different domain, tear it down (`sudo ./uninstall.sh --delete-cluster`, then remove `amp-caddy` and its volumes as in [Persistence and teardown](#adv-persistence)) and install again with the new config. (Switching only the `TLS_MODE` between `letsencrypt`/`byoc`/`upstream` while keeping the same hostnames is fine — that only re-renders Caddy.)

## Connect an external gateway[​](#adv-external-gw "Direct link to Connect an external gateway")

This works the same as in the Simple tab: the control-plane endpoint `https://cp.<DOMAIN_BASE>` is exposed by default. Generate a registration token from **Infrastructure → Gateways** and follow the generated commands. Set `EXTERNAL_GATEWAYS=false` to drop the endpoint if you do not connect external gateways. The registration token grants a gateway your LLM-provider API keys — treat it as a secret and revoke it when a gateway is decommissioned.

## Troubleshooting[​](#adv-troubleshooting "Direct link to Troubleshooting")

* **Config rejected before install** — the installer prints which key is missing or invalid (e.g. an unknown `TLS_MODE`, or `byoc` without `TLS_CERT_FILE`). Fix `amp-config.env` and re-run.
* **Certificate validation failed (byoc)** — the cert and key do not match, the cert is expired, or its SANs do not cover a service host or the `*.<AGENTS_BASE>` wildcard. The message names the specific problem; reissue the certificate with the required SANs (see [BYOC certificate requirements](#adv-byoc-san)).
* **DNS pre-flight failed (letsencrypt)** — one or more hostnames do not resolve to the VM. Create the A records listed under [DNS](#adv-dns) and re-run. The message names the hosts and the expected IP.
* **Certificates never issue / hosts unreachable (letsencrypt)** — open inbound `:443` in your cloud security group, and ensure it reaches the VM as **raw TCP**; a TLS-terminating load balancer in front breaks TLS-ALPN-01. If you have such a load balancer, use `upstream` mode instead. Check `docker logs amp-caddy`.
* **Changed the domain but the console still shows the old hostnames** — re-running with a new `DOMAIN_BASE` does not reconfigure existing releases. Tear down and reinstall (see [Persistence and teardown](#adv-persistence)).
* **`403 insufficient permissions` on API calls** — sign in as `amp-admin`, not Thunder's system `admin` account.
