The host agent (zwrm-agent) runs on each worker host in a multi-host cluster. It receives gRPC commands from the control plane to start, stop, and destroy VMs, pulls images from the control plane’s registry, and reports resource usage via heartbeats.
CLI flags
| Flag | Default | Description |
|---|
-config | /etc/zwrm/agent.toml | Path to agent config file |
-host-id | from config | Override host ID |
-grpc-port | from config | Override gRPC listen port |
-control-plane-url | from config | Override control plane URL |
CLI flags override config file values when provided.
Configuration
host_id = "" # Auto-generated UUID if empty
grpc_port = 9090
control_plane_url = "http://127.0.0.1:8080"
license_key = ""
region = "" # Defaults to "default" on control plane
datacenter = "" # Defaults to "default" on control plane
capacity_cpus = 4 # Advertised CPU capacity
capacity_memory_mb = 8192 # Advertised memory (MB)
image_cache_dir = "/var/lib/zwrm/image-cache"
image_cache_max_gb = 20 # Max cache size before LRU eviction
If host_id is empty, a UUID is auto-generated and persisted back to the config file.
The install script auto-detects CPU count and memory from the system and writes them to agent.toml.
Systemd service
Create /etc/systemd/system/zwrm-agent.service:
[Unit]
Description=ZWRM Agent
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/zwrm-agent -config /etc/zwrm/agent.toml
Restart=on-failure
RestartSec=5
User=root
Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
StandardOutput=journal
StandardError=journal
SyslogIdentifier=zwrm-agent
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now zwrm-agent
sudo journalctl -u zwrm-agent -f
Registration and heartbeat
Registration
On startup, the agent registers with the control plane via HTTP:
POST {control_plane_url}/v1/internal/hosts/register
The payload includes host ID, hostname, region, datacenter, IP address, gRPC port, and capacity. The agent’s outbound IP is auto-detected. Registration retries with exponential backoff (1s initial, 30s max) until success.
Heartbeat
Every 10 seconds, the agent sends a heartbeat:
POST {control_plane_url}/v1/internal/hosts/{host_id}/heartbeat
The heartbeat includes machine count (from /proc/*/cmdline), allocated memory (from /proc/meminfo), and allocated CPUs.
If the control plane doesn’t receive a heartbeat for 30 seconds, it marks the host offline and fails all its running machines.
After 3 consecutive heartbeat failures, the agent logs a warning. When it recovers, it re-registers in the background.
gRPC service
The agent exposes a gRPC service (agent.HostAgent) that the control plane calls to manage VMs:
| RPC | Description |
|---|
StartVM | Start a Firecracker VM. Pulls image from registry if image_ref is provided. |
StopVM | Stop a running VM by killing its Firecracker process. |
DestroyVM | Destroy a VM and clean up resources (TAP, socket). |
GetVMStatus | Check if a VM process is alive. |
Heartbeat | gRPC liveness probe. |
Image cache
When the control plane schedules a VM on a remote host, the agent pulls the image from the control plane’s built-in registry:
- Check local cache (
{image_cache_dir}/sha256-{hex}.ext4)
- If cached, touch mtime for LRU freshness and return
- If not cached, download from
GET {control_plane_url}/v1/internal/images/{ref}
- Verify SHA256 hash matches the image ref
- Atomic rename to final cache path
- Trigger LRU eviction if cache exceeds
image_cache_max_gb
Concurrent pulls of the same image are deduplicated.
Authentication
When license_key is configured, all communication is authenticated:
- gRPC: SHA-256 hash of the license key sent as
Bearer <hash> in metadata
- HTTP image pull: Same hash sent as
Bearer <hash> in Authorization header
- Registration: Raw license key sent in the register request for license validation
The raw key is never sent over gRPC.
Socket cleanup
- On startup: Removes stale
firecracker-*.socket files older than 60 seconds
- Every 30 seconds: Same cleanup runs periodically
The agent does not kill Firecracker processes or delete TAP devices — those are managed by the control plane.
Shutdown
Triggered by SIGINT or SIGTERM:
- Cancel background tasks (heartbeat, cleanup)
- Deregister from control plane (5-second timeout)
- Graceful gRPC stop (30-second timeout for in-flight RPCs)