Skip to main content
The host agent (zwrm-agent) runs on each worker host in a multi-host cluster. It receives gRPC commands from the control plane to start, stop, and destroy VMs, pulls images from the control plane’s registry, and reports resource usage via heartbeats.

CLI flags

FlagDefaultDescription
-config/etc/zwrm/agent.tomlPath to agent config file
-host-idfrom configOverride host ID
-grpc-portfrom configOverride gRPC listen port
-control-plane-urlfrom configOverride control plane URL
CLI flags override config file values when provided.

Configuration

host_id = ""                              # Auto-generated UUID if empty
grpc_port = 9090
control_plane_url = "http://127.0.0.1:8080"
license_key = ""
region = ""                               # Defaults to "default" on control plane
datacenter = ""                           # Defaults to "default" on control plane
capacity_cpus = 4                         # Advertised CPU capacity
capacity_memory_mb = 8192                 # Advertised memory (MB)
image_cache_dir = "/var/lib/zwrm/image-cache"
image_cache_max_gb = 20                   # Max cache size before LRU eviction
If host_id is empty, a UUID is auto-generated and persisted back to the config file.
The install script auto-detects CPU count and memory from the system and writes them to agent.toml.

Systemd service

Create /etc/systemd/system/zwrm-agent.service:
[Unit]
Description=ZWRM Agent
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/zwrm-agent -config /etc/zwrm/agent.toml
Restart=on-failure
RestartSec=5
User=root
Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
StandardOutput=journal
StandardError=journal
SyslogIdentifier=zwrm-agent

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now zwrm-agent
sudo journalctl -u zwrm-agent -f

Registration and heartbeat

Registration

On startup, the agent registers with the control plane via HTTP:
POST {control_plane_url}/v1/internal/hosts/register
The payload includes host ID, hostname, region, datacenter, IP address, gRPC port, and capacity. The agent’s outbound IP is auto-detected. Registration retries with exponential backoff (1s initial, 30s max) until success.

Heartbeat

Every 10 seconds, the agent sends a heartbeat:
POST {control_plane_url}/v1/internal/hosts/{host_id}/heartbeat
The heartbeat includes machine count (from /proc/*/cmdline), allocated memory (from /proc/meminfo), and allocated CPUs. If the control plane doesn’t receive a heartbeat for 30 seconds, it marks the host offline and fails all its running machines. After 3 consecutive heartbeat failures, the agent logs a warning. When it recovers, it re-registers in the background.

gRPC service

The agent exposes a gRPC service (agent.HostAgent) that the control plane calls to manage VMs:
RPCDescription
StartVMStart a Firecracker VM. Pulls image from registry if image_ref is provided.
StopVMStop a running VM by killing its Firecracker process.
DestroyVMDestroy a VM and clean up resources (TAP, socket).
GetVMStatusCheck if a VM process is alive.
HeartbeatgRPC liveness probe.

Image cache

When the control plane schedules a VM on a remote host, the agent pulls the image from the control plane’s built-in registry:
  1. Check local cache ({image_cache_dir}/sha256-{hex}.ext4)
  2. If cached, touch mtime for LRU freshness and return
  3. If not cached, download from GET {control_plane_url}/v1/internal/images/{ref}
  4. Verify SHA256 hash matches the image ref
  5. Atomic rename to final cache path
  6. Trigger LRU eviction if cache exceeds image_cache_max_gb
Concurrent pulls of the same image are deduplicated.

Authentication

When license_key is configured, all communication is authenticated:
  • gRPC: SHA-256 hash of the license key sent as Bearer <hash> in metadata
  • HTTP image pull: Same hash sent as Bearer <hash> in Authorization header
  • Registration: Raw license key sent in the register request for license validation
The raw key is never sent over gRPC.

Socket cleanup

  • On startup: Removes stale firecracker-*.socket files older than 60 seconds
  • Every 30 seconds: Same cleanup runs periodically
The agent does not kill Firecracker processes or delete TAP devices — those are managed by the control plane.

Shutdown

Triggered by SIGINT or SIGTERM:
  1. Cancel background tasks (heartbeat, cleanup)
  2. Deregister from control plane (5-second timeout)
  3. Graceful gRPC stop (30-second timeout for in-flight RPCs)